Article Text

Download PDFPDF

Original research
Patient Reported Outcome Measures for Rheumatoid Arthritis Disease Activity: a systematic review following COSMIN guidelines
  1. Tim Pickles1,
  2. Rhiannon Macefield2,
  3. Olalekan Lee Aiyegbusi3,4,5,6,7,
  4. Claire Beecher8,9,10,
  5. Mike Horton11,
  6. Karl Bang Christensen12,
  7. Rhiannon Phillips13,
  8. David Gillespie1 and
  9. Ernest Choy14
  1. 1Centre for Trials Research, Cardiff University, Cardiff, UK
  2. 2Bristol Centre for Surgical Research, Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK
  3. 3Institute of Applied Health Research, University of Birmingham, Birmingham, UK
  4. 4Centre for Patient Reported Outcomes Research, Institute of Applied Health Research, University of Birmingham, Birmingham, UK
  5. 5NIHR Applied Research Collaboration West Midlands, Birmingham, UK
  6. 6Birmingham Health Partners Centre for Regulatory Science and Innovation, University of Birmingham, Brmingham, UK
  7. 7NIHR Birmingham Biomedical Research Centre, NIHR Surgical Reconstruction and Microbiology Research Centre, University of Birmingham, Birmingham, UK
  8. 8School of Nursing and Midwifery, National University of Ireland Galway, Galway, Ireland
  9. 9Evidence Synthesis Ireland and Cochrane Ireland, National University of Ireland Galway, Galway, Ireland
  10. 10Health Research Board - Trials Methodology Research Network, National University of Ireland, Galway, Ireland
  11. 11Psychometric Laboratory for Health Sciences, University of Leeds, Leeds, UK
  12. 12Section of Biostatistics, University of Copenhagen, Copenhagen, Denmark
  13. 13Cardiff School of Sport and Health Sciences, Cardiff Metropolitan University, Cardiff, UK
  14. 14Department of Infection and Immunity, Cardiff University, Cardiff, UK
  1. Correspondence to Tim Pickles; PicklesTE{at}cardiff.ac.uk

Abstract

Background The current standard of care in rheumatoid arthritis (RA) requires regular assessment of disease activity (DA). All standard RA DA measurement instruments require joint counts to be undertaken by a healthcare professional with/without a blood test. Few healthcare providers have the capacity to assess patients as frequently as stipulated by guidelines. Patient Reported Outcome Measures (PROMs) could be an efficient and informative way to assess RA DA, which is highlighted by the SARS-COV-2 pandemic, as most consultations are remote rather than face-to-face. We aimed to assess all PROMs for RA DA against the internationally recognised COSMIN guidelines to provide evidence‐based recommendations to select the most suitable PROMs.

Methods Review registered on PROSPERO as CRD42020176176. The search strategy was based on a previous similar systematic review and expanded to include all articles up to January 2019. All identified articles were rated by two independent assessors following the COSMIN guidelines.

Results 668 abstracts were identified, with 10 articles included. A further 21 were identified from a previous review. Ten PROMs were identified. There was insufficient evidence to place any of the identified PROMs into recommendation for use category A due to lack of evidence for content validity, as stipulated by the COSMIN guidelines.

Conclusion Lack of evidence of content validity limits suitable PROM selection, therefore none can be recommended for use. It is acknowledged that all included PROMs were developed before the COSMIN guidelines were published. Future research on PROMs for RA DA must provide evidence of content validity.

  • rheumatoid arthritis
  • patient reported outcome measures
  • health services research

Data availability statement

Data sharing not applicable as no datasets generated and/or analysed for this study.

http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Key messages

What is already known about this subject?

  • Ten Patient Reported Outcome Measures (PROMs) have been developed to measure in rheumatoid arthritis (RA) disease activity (DA). Previous reviews have suggested the use of RADAI, RADAI5, PAS-II and RAPID3.

What does this study add?

  • This is the first systematic review of PROMs for RA DA that follows the recent COSMIN guidelines.

  • There was insufficient evidence to recommend any of the identified PROMs for use due to lack of evidence for content validity.

How might this impact on clinical practice or further developments?

  • Care should be taken when making use, or interpreting the results, of any of the PROMs for RA DA identified in this review.

  • Future research on the PROMs identified here, or any future developed PROMs for RA DA, must provide evidence of content validity.

Introduction

The standard measurement instrument for assessing disease activity (DA) for patients with rheumatoid arthritis (RA) has, for many years, been the Disease Activity Score (DAS) with 28-joint count (DAS28),1 and more recently Simple Disease Activity Index (SDAI) and Clinical Disease Activity Index (CDAI).2 DAS28 has four variants3 but all require a laboratory test of either erythrocyte sedimentation rate or C reactive protein (CRP), and a formal tender and swollen joint count assessment (of shoulders, elbows, wrists, hands and knees) undertaken by a healthcare professional. Some of the DAS28 variants also factor in a patient global assessment on a 10 cm visual analogue scale, which adds a level of patient involvement. In common with DAS28, SDAI and CDAI require tender and swollen joint counts and a patient global assessment. In addition, CRP and a physician global assessment are also required for SDAI and CDAI, respectively. Between joint counts and laboratory tests, these assessments are very time-consuming and resource-intensive and can only be undertaken when a patient comes in for a scheduled consultation.

The current standard of care in RA is ‘Treat-to-Target’ (T2T), which aims for sustained remission or failing this, low DA score.4 5 Regular assessment of DA and adjustment of treatment accordingly is an integral part of T2T. National Institute for Health and Care Excellence (NICE)6 and the European Alliance of Associations for Rheumatology (EULAR)5 recommend DA is monitored every one to 3 months when disease is uncontrolled, and every 6–12 months when treatment target has been reached. Few healthcare providers have the capacity to assess patients as frequently as stipulated by NICE or EULAR guidelines: every 6 months is typically the best that is currently managed.7 The SARS-COV-2 pandemic has made the problem more conspicuous with remote rather than face-to-face consultations. With infrequent monitoring, treatment is not adjusted sufficiently to keep pace with fluctuation in DA. It can also be the case that those in RA remission are seen more often than necessary while opportunities to treat RA flares are often provided too late.8

Alongside this, and with the advent of patient-centred care and value-based healthcare, Patient Reported Outcome Measures (PROMs) have become the pertinent options for monitoring disease progression and quality of life in numerous fields.9 Unlike paper-based PROMs, electronic PROMs and computer adaptive test (CAT) platforms provide efficient, patient-friendly and location-independent methods of collecting such data, which can also satisfy the necessary properties required to enable useful measurement.9 These CAT platforms are developed under item response theory or Rasch measurement theory methodologies, and allow for patients to respond to a minimal set of items while still calculating an accurate estimate.10 Such examples are seen in the patient-reported outcomes measurement information system initiative.11 The research around RA DA has suggested PROMs might prove preferable to measures requiring biomarkers.12 13 Further, electronic versions of PROMs could be the future of measurement in rheumatology.7 14 15

To best understand the currently available psychometric evidence for these PROMs, a first step is to undertake a systematic review and assess the identified PROMs. Our review builds on the work of a 2016 systematic review in the same area,16 which concluded that three PROMS: Rheumatoid Arthritis Disease Activity Index (RADAI), Rheumatoid Arthritis Disease Activity Index-Five (RADAI5) and Routine Assessment of Patient Index Data 3 (RAPID3), plus another measurement instrument called Patient-derived Disease Activity Score with 28-joint counts (Pt-DAS28, which is DAS28 but with the 28-joint count completed by the patient) had the strongest and most extensive validation. This systematic review identified articles describing, and assessed the properties of, measurement instruments that are not PROMs. Here though, a tighter lens is applied to focus solely on PROMs in the justification for the inclusion of articles in this review. Furthermore, the accepted guidelines concerning these systematic reviews for assessing PROMs from COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) have been updated since the 2016 systematic review16 and now have a major focus on content validity.17–19 This inevitably influences the assessment of legacy PROMs like those for RA DA, which were developed prior to these guidelines. The recommendations for use, which are the endpoint of the application of the COSMIN guidelines, are based largely on content validity as well.

Given this clear and definitive gap, our objective was to systematically review all PROMs for RA DA against the internationally recognised 2018 COSMIN guidelines17–19 to provide evidence‐based recommendations for use of the most suitable PROMs in research and clinical practice.

Methods

This systematic review of all PROMs for RA DA was registered with the International Prospective Register of Systematic Reviews (PROSPERO)20 as CRD42020176176, where a protocol is available,21 and written in compliance with the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines,22 with the PRISMA checklist provided (online supplemental appendix 1).

Guidelines

COSMIN guidelines were applied throughout this systematic review.17–19 The review of Hendrikx et al16 used the original COSMIN guidelines.23–27 These have since been updated and it is these guidelines17–19 and the methodology within them that were implemented here.

Search strategy

A search strategy was required to identify relevant articles (online supplemental appendix 2). Hendrikx et al published their search strategy,16 which was based on a COSMIN guideline.28 This search strategy was tested and refined to ensure certain articles were identified. An adapted version of that strategy was used to search the PubMED and EMBASE databases up to January 2019. The search was undertaken by one reviewer (TP).

Article selection

Following the implementation of the search strategy, a single assessor (TP) undertook the work of reviewing the articles for relevance, through screening of titles and abstracts to assessing eligibility for the review. The following inclusion and exclusion criteria were used:

Inclusion criteria:

  1. The study population described in the article is of adult patients with RA.

  2. The article describes details on a PROM specifically for DA in RA that can be reviewed against the COSMIN guidelines17–19

  3. The article is in the English language.

  4. The article is published in a peer-reviewed journal.

Exclusion criteria

  1. The study population described in the article includes diseases other than RA (unless the details pertaining the patients with RA are presented separately).

  2. The study population described in the article includes children (unless the details pertaining the adult patients are presented separately);

  3. The article describes a measurement instrument that requires healthcare professional assessment;

  4. The article describes a measurement instrument that requires a biomarker level determined through laboratory test.

Data extraction: study population characteristics

Characteristics of the study population, including number of participants, age, gender, rheumatoid factor (per cent positive), disease duration and DA (at baseline if reported at multiple timepoints) were extracted by one assessor (TP). Where multiple study populations were described within a single article, these were pooled together and described as one population if the statistics presented allowed for this. Where study population characteristics were described in a separate article to that reviewed here, that separate article was sought out and characteristics extracted as necessary.

Data extraction: risk of bias, content validity and quality of measurement properties

Data on the relevant measurement properties were extracted and summarised in the ‘COSMIN checklist’ Microsoft Excel (2018) spreadsheet (online supplemental appendix 3) available for download from the COSMIN website.29 The spreadsheet contains the necessary risk of bias questions, which require the assessor to complete with categories: very good (V), adequate (A), doubtful (D), inadequate (I) or N/A (N). It also requires the assessor to rate the content validity and quality of measurement properties with ratings: sufficient (+), insufficient (–), inconsistent (±) or indeterminate (?). Decisions were made across the COSMIN domains of PROM development, content validity, structural validity, internal consistency, cross-cultural validity/measurement invariance, reliability, measurement error, criterion validity, hypothesis testing for construct validity and responsiveness. Two assessors (TP, and one of RM, OLA or CB) independently assessed the articles for risk of bias and quality of measurement property. Where ratings differed, a consensus was agreed by the two assessors.

For the purpose of the COSMIN criterion validity and responsiveness domains, a gold standard measurement instrument needed to be identified. DAS28,1 SDAI and CDAI2 are the most widely used and are accepted measurement instruments for this purpose and were therefore used as gold standard measurement instruments for this review.

For hypothesis testing for the COSMIN construct validity and responsiveness domains, subgroups within a rheumatoid arthritis population were assessed.

Where statistics, such as effect sizes, to be reviewed for COSMIN domains were not provided in an article, but could be readily calculated from other values given in the article, such as means and SD, the relevant statistics were calculated.

For all correlations, Spearman’s ρ and Kendall’s τ were accepted as appropriate statistical methods. Pearson’s r could also be considered as an appropriate statistical method if some form of distributional information of the PROM was provided, such as mean and SD or a histogram.

In the case where multiple statistics were used to assess for quality of a PROM’s measurement property (that is, one COSMIN domain) within an article, the lowest rating was used.

The COSMIN team were consulted for advice on the following two methodological points to confirm the best course of action.

Where risk of bias items required disease-specific knowledge, the independent assessors consulted with EC who, as a rheumatologist, is a clinical expert in RA to ensure consensus.

Assessment of some measurement properties according to the COSMIN guidelines requires statistical tests that can only be performed with specific software. In the case that the article stated that an analysis had been undertaken but the software stated and/or the outputs reported in the article were not feasible for that analysis, then the assessors followed the rating for risk of bias and quality of measurement properties for the actual analysis, rather than what it was stated as.

Determination of overall rating, quality of evidence and recommendation for use

The quality of the evidence for each summarised measurement property of each PROM was determined by assessors using the modified GRADE30 approach defined in the COSMIN guidelines.17–19 Quality of evidence has categories: high (H), moderate (M), low (L) and very low (VL). These were determined for each COSMIN domain for each PROM on the basis of risk of bias, inconsistency, imprecision and indirectness as specified in the COSMIN guidelines.17–19 The rules for downgrading levels are well set for risk of bias, imprecision and indirectness, but require some formulation for inconsistency. We decided the following:

  • If there was a ≥75% majority for a quality of measurement property rating across the studies, then there was no inconsistency.

  • If there was a majority towards one such quality of measurement property rating but that majority was 60% to <75%, then we noted the inconsistency as serious.

  • If there was no majority (50%) or if there was a majority towards one such quality of measurement property rating but that majority was 50% to <60%, then we noted the inconsistency as very serious.

The overall rating was the majority content validity or quality of measurement property rating for each domain for each PROM, so we similarly used categories sufficient (+), insufficient (–), inconsistent (±) or indeterminate (?). In the case of no majority, the lower quality of measurement property rating was used.

Once all summarised ratings for each COSMIN domain of each PROM were decided, recommendations for use of the reviewed PROMS for RA DA were applied according to the categories stipulated in the COSMIN guidelines.17–19 These are:

  1. PROM has evidence for sufficient content validity (any level) and at least low-quality evidence for sufficient internal consistency. Therefore, the PROM can be recommended for use and results obtained with these PROMS can be trusted;

  2. PROM cannot be categorised into A or C. Therefore, the PROM has potential to be recommended for use, but requires further research to assess its quality.

  3. PROM has high quality evidence for an insufficient measurement property. Therefore, the PROM should not be recommended for use.

Assessor agreement

Percentage agreement was calculated for all content validity and related risk of bias, plus all quality of measurement properties and related risk of bias combined. While there are three individuals acting as independent assessors, these are all combined here.

Results

Article identification

A PRISMA Flow Diagram22 31 presents the results of the search strategy and the proceeding reviewing undertaken to reach the 31 articles in this review (figure 1).

Figure 1

PRISMA flow diagram of article selection.

Of the 34 articles from the Hendrikx et al review16 (left side of figure 1), 13 articles32–44 were excluded. The 21 remaining articles45–65 from the Hendrikx et al review16 were included in this review.

After the deletion of 8 duplicates (identified twice by the same source), 668 articles were identified (right side of figure 1). Ten66–75 of these were included in the review, so a total of 31 articles were included. These 31 articles described ten PROMs: RADAI, Rheumatoid Arthritis Disease Activity Index-Short Form (RADAI-SF), RADAI5, RAPID3, Routine Assessment of Patient Index Data 4 (RAPID4), Patient-based Disease Activity Score 2 (PDAS2), Patient Activity Score (PAS), Patient Activity Score-II (PAS-II), Patient Reported Outcome CLinical ARthritis Activity (PRO-CLARA) and Global Arthritis Score (GAS). Of the 10 articles published since the Hendrikx et al review,16 8 described RAPID3 and RAPID4 under the COSMIN criterion validity domain, 1 described PRO-CLARA under the COSMIN hypotheses testing for construct validity domain and another one described RADAI under the COSMIN responsiveness comparison between subgroups subdomain.

Study characteristics

The characteristics of the study populations of the 31 articles are given in table 1. The majority of these articles (n=19) described a single PROM, nine described two PROMs, two described three PROMs and one article described four PROMs.

Table 1

Characteristics of the study populations of the 31 articles

Only 1 clinical trial68 was in the was included in this review of the 51 clinical trial articles sought for retrieval. It was notable that many excluded trials describe PROMs but not for RA DA, and where PROMs for RA DA were described, the statistical detail provided is only available to be assessed against the COSMIN guidelines in this single article.

Additional methodological requirements defined after articles were identified

The majority of PROMs described in the articles found in this review have a scoring system involving precalculation of a variable before summing up to create a total score, rather than just summing up the items they include. This was problematic for the assessment of the COSMIN structural validity and internal consistency domains, as, to be undertaken correctly and to provide relevant meaningful results, these require the individual items to be inputted into the analysis, rather than a combination of precalculated variable and individual items. Therefore, in the case where results that would be assessed under the COSMIN structural validity and internal consistency domains were given but the PROM had such a scoring structure, the result was ignored and a note added to the relevant cell or cells of the ‘COSMIN checklist’ Microsoft Excel (2018) spreadsheet stating this. All other COSMIN domains were still assessed, as those numeric COSMIN domains focus on analyses requiring the total score, rather than that of the individual items. A list of PROMs, the reasons why there was a problem and whether any structural validity or internal consistency analyses were undertaken are provided (online supplemental appendix 4).

For the assessment of the COSMIN measurement error domain, the minimally important change (MIC) must be defined for comparison against smallest detectable change (SDC) or limits of agreement (LoA). SDCs were calculated for RADAI and RAPID3 in one reviewed article63 and these were compared against values (labelled as minimally important difference (MID)) from an article not reviewed here.76 It was notable that in article not reviewed here,76 RAPID3 was correctly scored 0–30, which was the case in the article providing the MIC,76 but in the reviewed article,63 RAPID3 was scored 0–10, so for comparison, the MIC in Pope was divided by 3; 3.6 became 1.2.

Within the COSMIN hypothesis testing for construct validity domain, and specifically the COSMIN comparison with other outcome measurement instruments (convergent validity) subdomain, the second risk of bias item asked: ‘Were the measurement properties of the comparator instrument(s) adequate?’ and to answer this, there was a need to determine whether sufficient measurement properties of the comparator instrument were available and which study population they applied to. In all cases but one, there were ‘sufficient measurement properties of the comparator instrument(s) in a population similar to the study population’ except for pulp-to-palm distance, which only had measurement properties described in an orthopaedics population.77 For this reason, the risk of bias rating for RAPID3 in one of the reviewed articles72 was doubtful (D), regardless of the other comparator instruments.

Relevant specifically to quality of measurement properties ratings of the COSMIN hypothesis testing for construct validity and responsiveness domains, COSMIN recommend that the review team formulate a set of hypotheses.78 Therefore, hypotheses (online supplemental appendix 5) were agreed in consultation between all assessors and with clinical expertise from EC.

Content validity and related risk of bias

Only two articles described content validity, both under the heading of Cognitive interview study or other pilot test for the COSMIN comprehensibility domain (table 2, Content validity—PROM Development columns). These covered the PROMs PDAS2 (47) and PRO-CLARA.59

Table 2

Risk of bias rating, content validity rating and quality of measurement property rating of the 31 articles

For PDAS2, very little detail about the process was available other than the number of participants interviewed, which was 20, so a D risk of bias rating was given under cognitive interview study or other pilot test comprehensibility study. For the content validity comprehensibility, patients were not asked about the comprehensibility of item instructions and it was not clear if patients were asked about the comprehensibility of all of the items and their response options (some items were mentioned but others were not), so the overall content validity rating was –. The rating of reviewers was + because it was clear that all items and response options were appropriately worded and response options matched the items.

For PRO-CLARA, 72 patients were surveyed but a qualitative method should be used to assess content validity, so a D risk of bias rating was given under Cognitive interview study or other pilot test Comprehensibility study. There was no detail about patients being asked about the comprehensibility of item instructions and of the items and their response options, so the overall content validity rating was ?. The rating of reviewers was + because it was clear that all items and response options were appropriately worded and response options matched the items.

Content validity, downgrading, overall rating and quality of evidence

There were no Relevance or Comprehensiveness ratings, so there were no overall content validity ratings; therefore, the quality of evidence rating for both of these was Very low. As there was no majority for content validity overall rating, the lowest was used and was thus insufficient (–) for PDAS2 and indeterminate (?) for PRO-CLARA (table 3, Content validity Comprehensibility row).

Table 3

Overall rating and quality of the evidence for measurement properties of the PROMs, plus recommendation for use category

Quality of measurement properties and related risk of bias

Table 2 columns to the right of content validity summarise the quality of measurement properties and related risk of bias ratings.

Quality of measurement properties, downgrading, overall rating and quality of evidence

The evidence in table 2, allowed the overall rating and quality of evidence to be determined, as presented in table 3.

Recommendations for use

Using the overall Rating and quality of evidence for each COSMIN domain within each PROM, recommendations for use in research and clinical practice, the main result of this systematic review, were attributed (table 3, final row) as follows:

  • Category B: RADAI-SF, PDAS2, PAS, PRO-CLARA and RAPID3.

  • Category C: RADAI, RADAI5, PAS-II, RAPID4 and GAS.

There were no PROMs attributed to Category A, as none had sufficient evidence of content validity. All Category C PROMs had at least one COSMIN domain with High quality evidence for an insufficient (-) measurement property and all Category B PROMs had at least one COSMIN domain with High quality evidence for a sufficient (+) measurement property, except RAPID3, which, at best, had Moderate quality evidence for an insufficient (-) measurement property. Despite this, it fitted into neither Category A nor C and was therefore attributed to Category B.

Assessor agreement

From a total of 435 ratings, 399 were in agreement, giving an overall agreement of 91.7%.

Discussion

The lack of sufficient evidence for content validity means that no PROMs identified in this review can be recommended for use (attributed to Category A) in research and clinical practice. PROMs RADAI-SF, PDAS2, PAS, PRO-CLARA and RAPID3 are attributed to Category B and therefore have potential to be recommended for use, but require further research to assess their quality. PROMs RADAI, RADAI5, PAS-II, RAPID4 and GAS are attributed to Category C and therefore should not be recommended for use.

RAPID3 is attributed to Category B despite, at best, having Moderate quality evidence for an insufficient (-) measurement property for the COSMIN responsiveness domain. This is a lower level of evidence than all PROMs in Category C, which have at least one COSMIN domain with High quality evidence for an insufficient (-) measurement property. It would appear as a limitation of the COSMIN guidelines that there is no Category D for PROMs like RAPID3.

While not possible to excuse the research community for not having undertaken the necessary research, it is notable that all identified PROMs were first described before or in the same year as the first set of COSMIN guidelines in 2010,23–27 and therefore all before the updated COSMIN guidelines17–19 in which content validity was prioritised. This research must be done before any of these PROMs can be recommended and is also true of any new PROMs that are developed for the measurement of RA DA.

It is also important to note the fact that many of the PROMs identified here have a scoring system involving precalculation of a variable before summing up to create a total score, rather than just summing up the items they include, and that this causes an issue with the assessment of the COSMIN structural validity domain and the COSMIN internal consistency domain. That this is the case for 8 of the 10 PROMs (RADAI, PDAS2, PAS, PAS-II, RAPID3, RAPID4, PRO-CLARA and GAS) identified here suggests that there is a systematic reason behind this for PROMs for RA DA. Five of these precalculate a joint count variation and six precalculate a functional ability variation joint count variations are used in all three gold standard measurement instruments defined here (DAS28, CDAI and SDAI),1–3 while joint count and functional ability variations are key within the American College of Rheumatology (ACR) criteria.79 A desire to continue the use of known instruments with PROMs may have contributed to this issue and can be moved away from, as is seen in RADAI-SF and RADAI5.

Assessor agreement was considerably lower for content validity related risk of bias and content validity ratings. This is due to the paucity of information available in the two articles,47 59 and difficulty in interpreting what the authors actually undertook. Assessor agreement was much higher for quality of measurement property related risk of bias and quality of measurement property, and the overall agreement is also high. This is largely dominated by 231 quality of measurement property related risk of bias agreements on a Very good rating.

The ACR Rheumatoid Arthritis Disease Activity Measure Workgroup have published a systematic review of all RA DA measurement instruments80 and recommend the following two PROMs identified here as preferred measures for regular use: RAPID3 and PAS-II. DAS28, CDAI and SDAI are also recommended as preferred measures for regular use. Additionally, in this ACR review, two PROMs identified here reached the minimum standard for regular use: RADAI and RADAI5, and this was also the case for DAS, Patient Derived DAS28, Hospital Universitario La Princesa Index, Multi-Biomarker Disease Activity Score and Routine Assessment of Patient Index Data 5 (RAPID5).

The previous systematic review undertaken by Hendrikx et al in 201616 stated that, of the PROMs identified here, RADAI, RADAI5 and RAPID3 had the most extension validations and the strongest level of evidence. It also stated the same of a measurement instrument labelled as Pt-DAS28, which is not a PROM.

There are therefore recommendations for PROMs RADAI, RADAI5 and RAPID3 from both sources16 80 and PROM PAS-II from the ACR review,80 while here we cannot recommend any identified PROMs.

The ACR review,80 an update from 2012,81 includes all possible measurement instruments for assessing RA DA. It is therefore difficult to fully implement the COSMIN guidelines17–19 for this review as these relate solely to PROMs. Furthermore, their methods included a Delphi survey to aid with determining if measurement instrument should be recommended.

In the Hendrikx et al review,16 a previous set of COSMIN guidelines was employed23–27 that did not prioritise content validity. Also, measurement instruments reviewed included biomarkers and/or healthcare professional assessments, included articles contained information on the evolution of PROMs not yet in their finalised state, and other included articles described measurement instruments as PROMs when they fulfilled a different role. This review applies a tighter lens focusing solely on PROMs and makes use of the most recent COSMIN guidelines,17–19 which provides some reasoning behind the discrepancies noted above.

The set of assessors were not experts in RA and therefore did not have the knowledge to complete with certainty some risk of bias items. As mentioned, where this was the case, TP discussed the matter with EC, and then with the independent assessors. This is a limitation of the independence of the assessors on these few risk of bias items, as there was essentially only one opinion. Further limitations are that only one assessor (TP) undertook the search strategy, article selection and data extraction of study population characteristics.

We state the necessary hypotheses required for this review in the Additional methodological requirements defined after articles were identified. These were written in consultation with the review team and EC. For comparison, we searched PROSPERO for the term ‘COSMIN’ and limited the review to those registered in musculoskeletal Health area of review. A total of 184 records were returned but found only one published article that defines hypotheses.82 The hypotheses in this article relate solely to correlations. As we have stated, this articles also sets 0.5 as a lower bound for convergent correlations, but then uses 0.3–0.5 for semiconvergent correlations, where we use 0.4 as a lower bound, and also sets 0.3 as an upper bound for divergent correlations, where we use 0.3–0.5. There is of course no set guideline on where to place these bounds, and we see here that only correlations are set, where we also defined hypotheses for effect sizes and areas under the curve. The article found through PROSPERO82 makes reference to some of the updated 2018 COSMIN articles17 18 and also the online user manual, which does provide very similarly written generic hypotheses, which are reproduced from Measurement in Medicine.78 There is a case to attempt more detailed research into this area to provide guidelines on how review teams, or indeed researchers attempting the original research, could define these hypotheses.

None of the identified articles made use of item response theory or Rasch measurement theory to evidence the psychometric properties of these PROMs. These are defined for use in the COSMIN structural validity domain and can also provide evidence for the COSMIN cross-cultural validity/measurement invariance domain, for which there was no evidence. All evidence in the COSMIN structural validity domain was provided through confirmatory or explanatory factor analyses.52 53 64

In conclusion, no PROMs identified in this review can be recommended for use according to COSMIN guidelines17–19 due to lack of sufficient evidence for content validity. This is despite previous reviews16 80 suggesting the use of RADAI, RADAI5, PAS-II and RAPID3. All PROMs identified here were first described before initial COSMIN guidelines were published and thus also before the updated guidelines that prioritised content validity. The majority of identified PROMs have scoring systems that preclude evidence in the COSMIN structural validity and internal consistency domains. Care should be taken when making use, or interpreting the results, of any of the PROMs for RA DA identified in this review. Future research on the PROMs identified here, or any future developed PROMs for RA DA, must look to evidence content validity. Future developed PROMs should implement scorings systems without precalculation of variations entered into the scoring system with other items. These could also look to item response theory or Rasch measurement theory to evidence their psychometric properties. This would allow for any such PROMs to be developed in computer adaptive tests.

Data availability statement

Data sharing not applicable as no datasets generated and/or analysed for this study.

Ethics statements

Patient consent for publication

References

Supplementary materials

Footnotes

  • Twitter @picklestim, @lee_1881, @beecherclaire1, @Nannon_Phillips, @DaveGuk87, @ErnestChoy1

  • Contributors TP and EC drafted the initial research idea. TP undertook the search strategy, reviewed all titles and then abstracts for inclusion, reviewed all full-text articles, extracted results and assessed them against COSMIN guideline criteria. RM, OLA and CB extracted results and assessed them against COSMIN guideline criteria for the articles assigned by TP. All authors declare having read and made a substantial contribution to the final manuscript. TP is the guarantor.

  • Funding TP is supported by a National Institute of Health Research (NIHR) Doctoral Fellowship, funded by the Welsh Government through Health and Care Research Wales (NIHR-FS-19). The systematic review was funded via a Health and Care Research Wales Pathway to Portfolio Grant. This work was supported by the MRC-NIHR Trials Methodology Research Partnership (MR/S014357/1). OLA receives funding from the NIHR Birmingham Biomedical Research Centre (BRC), NIHR Applied Research Collaboration (ARC), West Midlands at the University of Birmingham and University Hospitals Birmingham NHS Foundation, Innovate UK (part of UK Research and Innovation), Gilead Sciences Ltd and Janssen Pharmaceuticals, Inc. OLA declares personal fees from Gilead Sciences Ltd, GlaxoSmithKline (GSK) and Merck outside the submitted work. RM is funded by the NIHR Bristol Biomedical Research Centre (BRC).

  • Disclaimer The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.