Article Text

Download PDFPDF

From inhibition of radiographic progression to maintaining structural integrity: a methodological framework for radiographic progression in rheumatoid arthritis and psoriatic arthritis clinical trials
  1. Robert Landewé1,
  2. Vibeke Strand2,
  3. Désirée van der Heijde3
  1. 1Department of Clinical Immunology & Rheumatology, Academic Medical Center/University of Amsterdam & Atrium Medical Center, Heerlen, The Netherlands
  2. 2Division of Immunology and Rheumatology, Stanford Hospital, Portola Valley, California, USA
  3. 3Department of Rheumatology, Leiden University Medical Center, Leiden, The Netherlands
  1. Correspondence to Professor Robert Landewé, Department of Clinical Immunology & Rheumatology, Academic Medical Center/University of Amsterdam & Atrium Medical Center, Heerlen, The Netherlands; landewe{at}rlandewe.nl

Abstract

Usually, a clinical trial in rheumatoid arthritis and psoriatic arthritis aiming to demonstrate that a new antirheumatic drug treatment can inhibit progression of structural damage has a ‘superiority design’: The new treatment is compared to placebo or to another active treatment. Currently, many new drug treatments have shown to be able to completely suppress progression (progression rates close to zero). For largely unknown reasons, during the last 10 years, radiographic progression rates in clinical trials have gradually decreased, so that progression rates in the comparator groups are often too low to demonstrate meaningful inhibition, and thus superiority of the new treatment. We here propose an alternative framework to demonstrate that new treatments have the ability to ‘preserve structural integrity’ rather than to ‘inhibit radiographic progression’. Anno 2013, preserving structural integrity is conceptually more realistic than inhibiting radiographic progression.

  • DMARDs (biologic)
  • Treatment
  • Rheumatoid Arthritis
  • Psoriatic Arthritis

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Radiographic progression has been an important outcome measure in rheumatoid arthritis (RA) and psoriatic arthritis (PsA) randomised controlled trials (RCTs). Synthetic and biological disease modifying antirheumatic drugs (DMARDs) have been labelled for ‘inhibition of progression of structural damage’ in RA and/or PsA on the basis of RCTs in which they demonstrated that progression of radiographic damage was statistically significantly less than progression in the control arm.

This scenario requires a classic ‘superiority trial design’ in which the efficacy of a new treatment is tested under the null hypothesis of ‘no superiority’ against a comparator treatment (or placebo) in the control arm of the trial. Thus far the majority of these superiority designs have succeeded, because the mean radiographic progression rate in the control arms was sufficient for statistical superiority of the new treatment over the control to be demonstrated, beyond measurement error and random variability.

Over the past 5 years, three developments, partially related, have increasingly jeopardised the usefulness of superior designs in RCTs for demonstration of ‘inhibition of structural progression’, and will be discussed below.

Definition of the problem

  1. There exists now a general tendency for less radiographic progression in more recent RCTs.1 The exact cause of these observed decreases is largely unknown, and an indepth discussion is beyond the scope of this article. One explanation may be that patients with less severe RA are included in modern RCTs, another that patients receive better treatment earlier in their disease course. The consequence of this development is that the signal of progression in the control arm of the trial becomes too low in relation to the (unchanged) level of noise (‘signal to noise ratio’) so that the beneficial effect of the new therapy can no longer be statistically supported, even if progression is completely absent in the active treatment group. Opponents of measuring radiographic progression as an outcome in RCTs use this low level of progression as an argument to point out that differences are too small to be clinically meaningful. However the real problem is one of statistical power rather than clinical relevance.

  2. As a consequence of the advances in therapeutic development in RA and PsA, many effective DMARDs are now available for treatment of patients. From an ethical point of view it can be argued that patients in the control arms of modern RCTs cannot remain untreated (eg, receiving placebo) for even short periods of time (eg, 3 months) as the goal of ‘standards of care’ has improved from ‘symptomatic relief’ to ‘attaining low disease activity’ to ‘clinical remission’.

    Although this may be a ‘luxury’, underscoring progress made in rheumatology, this challenge brings into question RCTs with classic superiority designs, because an improved standard of care implies a lower level of radiographic progression, which compounds the problems listed in #1.

  3. An ad hoc solution to the ethical dilemma of placebo treatment (and its duration) outlined under #2 is to limit the length of placebo treatment in an RCT to 3 months at most. At 3 months, patients without a predefined clinical response are allowed to switch to the active-treatment group (‘early rescue’). While ethically appropriate, early rescue limits the amount of radiographic progression to be observed in the control arm, assuming that a majority of patients in the control arm(s) are rescued compared with active treatment.

    A methodological approach for accounting for early rescue in terms of measuring radiographic progression is an imputation process called ‘linear extrapolation’. Linear extrapolation assumes that changes between baseline and 3 months of placebo (control) treatment can be extrapolated to a total period of 6–12 months, thereby estimating progression that would have occurred if placebo would have been continued for the entire time period. Theoretical objections against linear extrapolation are numerous, and this discussion is also beyond the scope of this article. Nonetheless linear extrapolation assumes a strictly linear course of progression within individual patients, which is artificial, and may tend to overestimate true progression.

    Although linear extrapolation is not ideal, it accounts appropriately for the otherwise unsolvable methodological problem of non-random attrition, meaning that patients with a higher likelihood of radiographic progression (those with highly active disease) are more likely to receive rescue, or leave the trial for lack of efficacy. As such, linear extrapolation has gained a place in RCTs with a classic superiority design.

These three related methodological issues emphasise the increasing difficulties associated with demonstrating superiority for inhibition of structural damage by a new treatment against a comparator or control treatment. Having been involved in the analyses and interpretation of many recent RCTs, the authors believe that a classical superiority design has become inappropriate for demonstrating lack of radiographic progression.

  • Still, this principle of disease modification, which has significantly furthered clinical development, requires that modern treatments for RA and PsA, diseases with an inherent propensity of causing irreversible and debilitating changes in joint structure, are capable of at least maintaining structural integrity and/or preventing further deterioration of already existing damage. DMARDs differ from symptom-modifying treatments in that they have proven efficacy in this regard, and attainment of this attribution should not be lost in the future.

Measurement of radiographic progression in RCTs

There exists consensus about how radiographic progression should be measured in RCTs in RA or PsA.2 Scoring is performed using methodology based on the Sharp scoring system. The original method and two modifications (van der Heijde-Sharp and Genant-Sharp) have been used in all RCTs which have resulted in regulatory labelling for ‘structural inhibition’. These methods assign scores to hands and feet for presence and severity of erosions and joint space narrowing (JSN) separately.

Two (or more) trained readers evaluate x-rays obtained at at least two time points in such a manner that they can compare changes in erosions and JSN joint by joint. By-patient change scores from both readers are averaged, and a total score, summing erosions and JSN is calculated for each patient.

It is relevant to further understanding of this article to emphasise that readers score x-rays without knowledge of treatment allocation or their sequence. They can therefore be considered truly ‘double-blind’.

Measurement error

Measuring radiographic damage and progression—as with any measurement—are sensitive to measurement error. Technical imaging differences, differences in joint positioning, differences in lighting and differences in image processing for reading are examples of sources of error beyond influence of the readers. ‘True’ intrareader and inter-reader variability are sources of inherent reader variability and compound the total level of error.

The set-up of a reading-session in which the readers score the radiographs pairwise without knowledge of treatment assignment or time-order allows estimation of the level of measurement error under a set of assumptions. The most important assumptions are that, as the time-order is blind, truly negative scores (‘repair’) are impossible. The latter assumption has been violated in practice, as demonstrated by us,3 but the contribution of ‘true repair’ in relation to ‘measurement error’ can be estimated as relatively minor.

The effects of measurement error can best be visualised in a probability plot.4 The negative change scores are depicted in the left side of the plot, and the positive changes in the right side. The mean group change score is mathematically similar to the sum of areas enclosed by the plotted observations and the x-axis. If this sum of areas is zero, the implication is that the mean group progression is zero. If the sum of areas is greater than zero, it means that the area enclosed in the right side of the plot is greater than the area enclosed in the left side of the plot, and there is radiographic progression at the group level.

This visualisation perfectly illustrates that even in a situation in which the mean progression score is zero, (highly) positive and (highly) negative individual scores can be found in every dataset, largely reflecting measurement error. It is impossible to determine if these extreme individual scores are due only to measurement error, true signal or a combination of both. The increasingly vocalised argument that small treatment differences can be caused by a few positive extreme observations only makes sense if positive and negative outliers are compared in a balanced manner. One way of investigating the sensitivity to the effects of outliers of a radiographic dataset is to remove increasing percentages of observations at both the extreme sides of the probability curve (‘trimming’), to find out what influence this has on mean effects and variability. In fact, this method artificially reduces the level of measurement error and therefore will have an impact on the SD, and the 95% CI surrounding the mean effect size. Ideally, the mean effect size should be approximately the same with increasing proportions of trimming, which makes it unlikely that outliers have had an important influence.

Potential solutions to demonstrate inhibition of radiographic progression

Replacing conventional radiography by MRI

The development of MRI scoring in RA and PsA has been significant.5 Scoring methods for MRI have been developed for RA (Rheumatoid Arthritis Magnetic Resonance Imaging Score; RAMRIS)6 and PsA (Psoriatic Arthritis Magnetic Resonance Imaging Score; PSAMRIS),7 both of which have undergone extensive validation and demonstrate relatively good ‘psychometric’ properties. MRI can detect changes in synovitis and osteitis (or: bone marrow oedema), erosions and changes in their number and appearance, and detect and quantify tenosynovitis. MRI is more sensitive than conventional radiography in showing (pre)erosions, and osteitis has shown predictive validity for erosions. MRI studies have been performed as small ‘companion studies’ in large RCTs, proving its ability to discriminate between active and comparator treatment (or placebo) with regard to synovitis, osteitis and the occurrence of erosions.

Important limitations in feasibility thus far limit the broad application of MRI for measuring structural integrity in RA and PsA. MRI may appropriately quantify erosions but the measurement of JSN, an integral part of structural integrity, is less well developed. MRI scoring is (typically) limited to the wrist and metacarpophalangeal joint (MCP) joints of the dominant hand, thereby excluding the other hand and both feet, which imposes methodological limitations to required variability. Limitations regarding availability of MRI, level of technical experience required, and reluctance of patients to undergo multiple (time-consuming) MRIs may further jeopardise already difficult accrual of eligible patients in RCTs in many countries. Most importantly, truly long-term observations with MRI demonstrating preservation of structural integrity are lacking. At the present time, therefore, the preferred use of MRI may be in short-term (Phase 2) proof-of-concept trials, where the combination of information about synovitis and erosions, together with its higher sensitivity, may facilitate appropriate decisions about the likelihood of treatment efficacy.

Optimising an enriched trial population

In RCTs with a superiority design it is common to preferentially include patients with a relatively high likelihood of radiographic progression. Inclusion criteria include statements about a minimum required level of disease activity, presence of rheumatoid factor or anticitrullinated antibodies, and presence of erosions at baseline. The use of these inclusion criteria is justified because they have predictive capacity for radiographic progression at the group level. From a methodological perspective, enrichment is advantageous, and one could decide to further enrich clinical trial populations to assure more radiographic progression in the control arms of RCTs. The problem is that over the last decade progression rates have decreased despite efforts at enrichment. Further, enrichment comes at the cost of external validity, in that the trial population, already not reflective of the patient population in clinical practice, will become even less representative. Lastly, prediction of progression is imperfect, and is only effective to some extent at the group level, and not at all on an individual patient basis. And at the present time, it is highly unlikely that simple and feasible measures (such as soluble biomarkers) will importantly improve predictability.

The non-inferiority trial

The null hypothesis underlying a non-inferiority trial, frequently advocated in situations in which an effective treatment is already available, is that the new treatment is inferior to the existing treatment. Only if the effect of the new treatment surpasses the predefined non-inferiority margin (NIM), which in turn is based on a clinically acceptable difference with the existing treatment (eg, 10% less), will the new treatment be declared ‘non-inferior’. Usually, there are other than clinical advantages of a new treatment (cost, adverse event profile, administration route, dosing, etc) that justifies a NIM being consistent with slightly less efficacy.

Non-inferiority trials have been proposed and performed in the context of clinical efficacy (‘signs and symptoms’), and function reasonably well, as long as there is a measurable and clinically relevant effect observed in both groups. Since the usual RCT in RA and PsA includes patients with a high level of disease activity, and the aim of treating patients with high disease activity is to achieve the minimum possible level of disease activity (clinical remission), the claim of non-inferiority makes sense as long as there is a sufficiently large and expected effect observed in the control arm (‘assay sensitivity’). Non-inferiority trials for inhibition of radiographic progression become complicated, however, because of small treatment differences currently evident in RCTs with a superiority design (eg, <1 total Sharp-units). NIMs that are an appropriate reflection of these low effect sizes will hugely increase sample sizes of such RCTs thereby becoming unfeasible.

From inhibition of radiographic progression towards maintaining structural integrity

As argued above, it is unlikely that adaptations to the classic superiority trial-design will ‘save’ radiographic progression as a realistic endpoint in future RCTs in RA and PsA. At the same time, it has become clear that future RCTs in RA and PsA will increasingly exploit non-inferiority designs, which complicates pursuing the endpoint of inhibition of radiographic progression, as argued.

Obviously the main aim of treating patients with chronic inflammatory diseases like RA and PsA, diseases with the intrinsic capability of destroying cartilage, bone and joint integrity, is to avoid this destruction, and to preserve structural integrity. While it has been demonstrated that radiographic progression largely occurs in joints with clinical inflammation,3 it is also true that part of progression can be inhibited by drugs independently of inhibition of inflammation.8

An appropriate DMARD should therefore be able to:

  1. Inhibit the inflammatory process (alleviating ‘signs and symptoms’)

  2. Prevent the (further) occurrence of structural damage (in part by reducing inflammation, or otherwise by directly influencing the destructive process).

  3. Improve and preserve physical function (driven by inflammation and structural damage)

Until 14 years ago, the best conventional DMARDs could do was to alleviate signs and symptoms and inhibit an already ongoing process of destruction that at that time was considered unstoppable. After the introduction of TNF inhibitors and non-TNF biological DMARDs, it has become increasingly clear that progression of destruction can indeed be halted almost completely, or at least reduced to very low, probably clinically irrelevant, levels. The concept of maintaining structural integrity has become a testable concept, while the concept of inhibiting radiographic progression in comparison to placebo or any control treatment appears untenable in future.

In the following paragraphs, we propose a methodology to test the concept of maintaining structural integrity as a framework to support labelling for new DMARDs. This includes conventional radiography and scoring methods, elaborates in part on the principles of non-inferiority trial designs, and makes use of achievements from past RCTs including enriched trial populations. Importantly, the framework is valid in the context of RCTs with a superiority design and a non-inferiority design!

A framework to test the concept of structural integrity in future clinical trials in RA and PsA

The concept of maintaining structural integrity is operationalised as a radiographic progression score equal to zero (no progression). Because of measurement error and natural variability, it is impossible to declare only if a group-progression score is exactly similar to zero. In statistical terms, therefore, investigating zero progression implies investigation of the margins of uncertainty around a mean progression score of zero (or slightly above zero). A future RCT that is designed to investigate if a new treatment has the capability of maintaining structural integrity needs to fulfil three important requirements:

(1) It should have assay validity; (2) It should have assay integrity; and (3) it should apply appropriate hypothesis testing. These requirements will be discussed below:

Assay validity

It is critical that such a RCT includes a population with sufficient likelihood of showing radiographic progression at the group level, in terms of intrinsic patient characteristics and duration of follow-up. Such a population should be enriched for the potential of further radiographic damage. Criteria for this can be based on historical data including all RCTs comparing previously approved TNF inhibitors or other biologicals against background methotrexate treatment (incomplete MTX-responder trials), on the basis of which a claim for ‘inhibition of radiographic progression’ for many of these biological agents have been awarded. Inclusion criteria and population characteristics at baseline should reflect those from these prior RCTs.

For RA this would imply: Criteria for disease activity at entry, rheumatoid factor positivity and/or anticitrullinated antibodies positivity and/or erosions present as baseline. For PsA this would imply: Criteria for disease activity and for erosive disease. Follow-up duration of such a future trial to test ‘maintenance of structural integrity’ should be a minimum of 6–12 months.

Assay integrity

Radiographs of hands and feet, performed at baseline and endpoint in all patients, will be scored pairwise (baseline and follow-up) using conventional scoring methods, by at least two independent readers. The precision of scoring can be improved by adding readers,9 which limits variability and may increase statistical power. It is critical that readers are not able to ascertain their true time order. To minimise the (theoretical) possibility of spuriously scoring towards zero (which may happen if a reader realises that the trial includes treatments with expected progression scores of zero) the imaging set should include a sufficient number of dummy images (eg, 50). These are historically obtained sets of radiographs from patients in whom change was demonstrated and the readers should show picking up these changed scores. These dummy sets provide credibility to accuracy of the readings and increase likelihood of an unbiased score. This is derived from the design of non-inferiority trials, in which it has been argued that ‘no change’ works in favour of declaring non-inferiority.

For related reasons, linear extrapolation of scores from patients discontinuing early can be applied rather safely in this trial design: After all, linear extrapolation may lead to an inadvertently high progression signal in an untreated (placebo) group (with relatively more patients withdrawing early). In a superiority design, such a spuriously high signal in the control group may lead to a statistical advantage for the active treatment group. In the proposed scenario this is of no particular concern: If linear extrapolation of patients withdrawing early leads to the imputation of scores >0, it helps deviating from a mean group progression rate of zero, and thus adds to conservatism.

Hypothesis testing

Hypothesis testing based on between-treatment comparisons will NOT be performed with regard to maintenance of structural integrity. The null hypothesis that will be tested is—in contrast to the null hypothesis in classic superiority designs—that the new treatment will NOT maintain structural integrity (ie, will show a change >0 in total Sharp-units). The statistical procedure will be a parametrical test for paired observations (eg, paired t test) comparing baseline scores and follow-up scores. Since the alternative hypothesis includes only one scenario (‘the new treatment preserves structural integrity’, or: ‘Change=0 Sharp-units’), one-sided testing should be applied, rather than the two possible alternative scenarios in a superiority design which requires two-sided testing.

The null hypothesis will be rejected if the upper border of the 95% CI surrounding the mean within-group change does NOT exceed the predefined ‘structural integrity margin (SIM)’, and the predicate of ‘maintaining structural integrity’ will be declared.

It is critical to define an appropriate ‘SIM’. The SIM conceptually mimics the NIM in non-inferiority trial designs. There are, however, important differences: The NIM is the resultant of considerations regarding many aspects of the new treatment in relation to the control arm, and may include weighted arguments of adverse events, cost, feasibility and others.

The SIM is based only on statistical arguments from historical data. The same set of RCTs as used under #1 (assay validity) would be used to obtain a mean estimate of radiographic progression under biological treatment (eg, at 6 months or at 12 months) under full intention-to-treat circumstances (ie, including all cases obtained by imputation). We propose to use this historical mean progression score as the SIM, because it is an appropriate historical estimate of what could be expected from the average biological treatments as of 2012. Although in theory, this SIM does not preclude all radiographic progression in every individual patient, it does provide sufficient certainty that the progression score belonging to a new therapy declared to maintain structural integrity will be equal to or lower than the expected progression under currently available best (standard) treatments. The figures visualise possible scenarios as well as conclusions based on them with regard to ‘maintaining structural integrity’ (figure 1).

Figure 1

Different scenarios that explain how radiographic progression over time in a group of patients with rheumatoid arthritis or psoriatic arthritis may lead to declaring maintenance of structural integrity or not. Dots reflect mean progression score and error bars reflect 95% CIs around mean progression score.

Conclusion

In this article we have proposed the concept of maintaining structural integrity as a replacement for the regulatory paradigm of inhibiting radiographic progression. Maintaining structural integrity implies a different way of thinking, which has repercussions for trial design and analyses. The framework proposed in this article preserves the recent achievements of clinical development in the field of rheumatology, and assures that adaptations to existing trial designs are conservative and based on historical information, including observed radiographic progression rates under optimal treatment conditions. Importantly, this framework can be applied in classic superiority as well as non-inferiority trial designs, which is an advantage in the currently rapidly changing clinical development environment in rheumatology. Although further study and analysis should be done to implement this framework, it can relatively easily be applied, even in ongoing RCTs.

References

Footnotes

  • Contributors All authors have equally contributed in the discussions about the concept of the work. RL has written the manuscript, VS and DvdH have critically reviewed the manuscript, and all three authors have agreed to authorship.

  • Competing interests None.

  • Provenance and peer review Not commissioned; externally peer reviewed.