Article Text
Abstract
Objectives In axial spondyloarthritis (axSpA), early diagnosis is crucial, but diagnostic delay remains long and diagnostic criteria do not exist. We aimed to identify a diagnostic model that distinguishes patients with axSpA from patients without axSpA with chronic back pain based on clinical data in routine care.
Methods Clinical data from patients with chronic back pain were used, with information on rheumatological examinations based on clinical indications. The total dataset was randomly divided into training and test datasets at a 7:3 ratio. A machine learning-based model was built to distinguish axSpA from non-axSpA using the random forest algorithm. Overall accuracy, sensitivity, specificity and the area under the receiver operating characteristic curve-area under the curve (ROC-AUC) in the test dataset were calculated. The contribution of each variable to the accuracy of the model was assessed.
Results Data from 939 randomly selected patients were available: 659 diagnosed with axSpA and 280 with non-axSpA. In the test dataset, the model reached an accuracy of 0.9234, a sensitivity of 0.9586, a specificity of 0.8438 and a ROC-AUC of 0.9717. Human leucocyte antigen B27 (HLA-B27) contributed most to the accuracy of the model; that is, the accuracy would suffer most from not using HLA-B27, followed by insidious onset of back pain and erosions in the sacroiliac joint.
Conclusions We provide a machine learning-based model that reveals high performance in diagnosing patients with chronic back pain with axSpA versus without axSpA based on information from a tertiary rheumatology practice. This model has the potential to improve diagnostic delay in patients with axSpA in daily routine settings.
- Axial Spondyloarthritis
- Machine Learning
- Epidemiology
Data availability statement
Data are available upon reasonable request.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
WHAT IS ALREADY KNOWN ON THIS TOPIC
The diagnostic delay in axial spondyloarthritis (axSpA), averaging 6.8 years (95% CI 6.2 to 7.3) across 54 axSpA studies worldwide, is one of the longest within the field of rheumatology and understanding the diagnostic approach represents an unmet need.
WHAT THIS STUDY ADDS
We provide a machine learning-based diagnostic model with high performance in distinguishing between axSpA and non-axSpA in patients with chronic back pain based on evaluations made in the daily practice of a specialised clinic.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
The diagnostic model aims to assist rheumatologists in diagnosing axSpA. The clinical applicability as a diagnostic support tool for axSpA can be validated in further studies with external datasets.
Introduction
Axial spondyloarthritis (axSpA) represents a diagnostic challenge due to the heterogeneity of clinical symptoms against a high prevalence of non-specific back pain. Specifically, axSpA is characterised by non-specific symptoms (eg, chronic back pain) that are generally common and usually explained by other conditions, such as degenerative disc disease or functional disorders.1–3 Years of delays in the diagnosis of axSpA are common, with a pooled mean delay to diagnosis of 6.8 years (95% CI 6.2 to 7.3) across 54 axSpA studies worldwide.4 These delays hinder early treatment, which is of utmost importance to decrease the burden of disease and increase health-related quality of life.5
The xSpA primarily affects the sacroiliac joints (SIJs) and spine,6 encompassing both radiographic axSpA, characterised by definite radiographic sacroiliitis and non-radiographic axSpA, where such radiographic evidence is absent.7 8 Structural changes in the SIJs suggestive of sacroiliitis include erosions, fat lesions, ankylosis and sclerosis, while active inflammatory changes include bone marrow oedema. Both active and structural changes can be assessed with MRI.
In daily practice, axSpA is diagnosed based on a range of features that include symptoms reported in the patient’s history, current physical examination findings, laboratory test results and medical imaging data. To assist rheumatologists in diagnosing axSpA, clinical algorithms have been developed, considering all known clinical features, imaging findings and laboratory test results. So far, different referral strategies have been published that may help identify patients in primary care and refer them to the rheumatologist.9–12 The Assessment of Spondyloarthritis International Society (ASAS) has also contributed an algorithm for earlier identification of the disease.13 This algorithm suggests that the presence of four or more spondyloarthritis (SpA) features in patients with chronic low back pain and young age at symptom onset is suggestive of SpA in the absence of advanced X-ray sacroiliitis (in its presence the diagnosis of ankylosing spondylitis is suggested). For patients who have equal to or less than three SpA features, further workup by obtaining human leucocyte antigen B27 (HLA-B27) and MRI for detection of sacroiliitis is considered necessary for reaching a diagnosis. The algorithm adopted by ASAS showed sensitivities ranging from 77.9% to 79.7% and specificities of 78.3% to 80.4%,13 using the diagnosis by a rheumatologist as the reference standard, which is the best available. Although this algorithm’s overall performance was considered satisfactory, it still underdiagnosed approximately 20% of patients with axSpA.
Objectives
This study aimed to identify a machine learning-based diagnostic model for axSpA, where the model development was based on the diagnostic approach applied in daily practice by experienced rheumatologists specialised in axSpA, incorporating clinical, laboratory and imaging characteristics. The contribution of each characteristic to the accuracy of the diagnostic model in differentiating between axSpA and non-axSpA was evaluated, where the term non-axSpA was used for cases ruled out as axSpA.
Methods
Clinical data of patients who presented with chronic back pain to a specialised tertiary rheumatology centre in Germany were retrospectively retrieved from the hospital information system, gathering information on clinical, laboratory and imaging findings obtained from the rheumatological examination all patients had received. Clinical data comprised symptom duration, onset of low back pain (insidious vs acute), morning stiffness of the back, awakening at the second half of the night due to low back pain, good response of back pain to non-steroidal anti-inflammatory drugs (NSAIDs), improvement of low back pain with exercise, presence of arthritis, uveitis, dactylitis, psoriasis, Crohn’s/colitis (all current or previous) and family history for axSpA or other associated rheumatic/autoimmune conditions. Laboratory data included HLA-B27 results and C reactive protein (CRP) level. Imaging data included bone marrow oedema, erosion, sclerosis, fat metaplasia and ankylosis, all on MRI of SIJs. Their interpretation was left to the description of the radiologists. In total, two radiologists, trained in imaging of the musculoskeletal system and with expertise in spondyloarthritis, were producing the imaging reports for all patients. Quantification of the extension of the lesions was not performed since no imaging score is applied in routine care. However, the notification of the presence of lesions in the context of SpA (qualitative approach) was noted in all reports, while lesions that were not considered SpA-like are routinely not accounted for detection of SpA in our clinic. The information used for calculation in the algorithm was taken from the qualitative/descriptive report.
The diagnosis (axSpA vs non-axSpA) was made in consensus by two rheumatologists, both with >6 years of clinical experience, and used as the reference standard.
Statistical analysis
Descriptive analyses were conducted using median and IQR for quantitative variables and absolute frequencies and percentages (%) for qualitative variables.
The total dataset was divided into two parts using stratified random sampling to ensure similar distributions of patients with axSpA and without axSpA in both subsets. The first subset, referred to as training data, contained 70% of the total dataset and was used to build a model that distinguishes between patients with axSpA and without axSpA. Whereas the second subset, referred to as test data, contained the remaining 30% and was used to assess the performance of the developed model. The test data played no role in building the model.
The model was built by applying the random forest algorithm, an ensemble machine learning technique that uses decision trees and is based on an extended bagging algorithm. It offers numerous advantages, including its efficiency in model building and evaluation, its robustness to outliers, its ability to capture complex non-linear relationships, its handling of class imbalances, its management of overfitting and its coping with small sample sizes.14–17
In the main analysis, only complete observations were analysed. Clinical SpA features, laboratory results and imaging characteristics were used as input variables. The mean decrease accuracy (MDA) measure was used to assess the contribution of each variable to the accuracy of the model in differentiating between axSpA and non-axSpA.18 For each variable, it measures how much the model’s accuracy decreases if that variable is not used in the model. Hyperparameters of the random forest algorithm include the number of trees in the forest, which was set to 500, and the number of variables randomly selected at each node to look for the best split, which was set to the square root of the total number of input variables. In sensitivity analyses, different hyperparameters were investigated. Quantitative variables were normalised to a range of 0–1. The model’s performance was evaluated in the test dataset by overall accuracy, sensitivity, specificity, F1 score and the area under the receiver operating characteristic curve-area under the curve (ROC-AUC). Furthermore, the performance of the ASAS algorithm (without using information on X-ray sacroiliitis) was evaluated in the test dataset and compared with the performance of the machine learning-based model.
Sensitivity analysis
The model building, which previously only used complete observations, was repeated on imputed data to include the total study population, where missing data were imputed within the random forest algorithm by an iterative procedure using proximities between complete observations.18 Similar to the complete case analysis, 70% of the data were used as training data and 30% to test the model’s performance. Imputation was performed after the data set was split.
Subanalyses
In a first subanalysis, data are initially analysed on clinical and laboratory parameters only, without incorporating MRI information, to identify which patients may benefit most from imaging. Patients with an axSpA probability of at least 80% based on analysing clinical and laboratory parameters only were categorised as having axSpA, and no MRI information was needed, while for the remaining patients, MRI data were subsequently used to apply the model from the main analysis. This two-step approach is particularly relevant in regions where long MRI waiting times are common.
The second subanalysis is similar to the first subanalysis but excludes HLA-B27 information at both stages since HLA-B27 positivity is considerably more prevalent in white patients with axSpA than in the general population. Patients showing an axSpA probability of 85% or higher in the first step were categorised as having axSpA (we used a slightly higher threshold than in the first subanalysis due to the exclusion of one more variable) and no MRI information was needed, while for the remaining patients, MRI data were subsequently used.
All analyses were performed in R V.4.1.2.
Results
Data from 939 patients were identified in the hospital information system and were evaluated retrospectively, including 659 patients (70%) with axSpA and 280 (30%) with non-axSpA. Patients with axSpA, compared with patients without axSpA, were more often male (63% vs 48%), had a higher median age (39 vs 35 years) and had a longer median symptom duration (2 years vs 1 year). HLA-B27 positivity was more prominent in patients with axSpA than in patients without axSpA (69% vs 14%), as was CRP elevation (57% vs 15%).
Except for awakening in the second half of the night due to back pain, characteristics of inflammatory back pain were more often present in patients with axSpA than in patients without axSpA, as were imaging findings (table 1).
Building the machine learning-based model using training data
A diagnostic model that categorised patients into axSpA or non-axSpA was trained using all characteristics from table 1 as input variables for the random forest algorithm, except for symptom duration, which was excluded due to the high number of about 30% of missing values. Complete data for the remaining variables was available for 701 patients, including 485 patients (69%) with axSpA and 216 (31%) without axSpA. Characteristics of these patients were comparable to the total population of 939 patients (online supplemental table 1). Of the 701 patients, 492 patients were included in the training and 209 patients in the test data, with proportions of 69% axSpA and 31% non-axSpA in both datasets.
Supplemental material
Performance assessment of the machine learning-based model using test data
Based on the test data, the machine learning-based model reached an accuracy of 0.9234, a sensitivity of 0.9586, a specificity of 0.8438, an F1 score of 0.9456 and a ROC-AUC of 0.9717 (figure 1). The ASAS algorithm yielded an accuracy of 0.8086, a sensitivity of 0.8276, a specificity of 0.7656 and an F1 score of 0.8571 in the test data. The machine learning-based model outperformed the ASAS algorithm in all of these measures. The characteristics of patients correctly categorised by the machine learning-based model but not by the ASAS algorithm are depicted in online supplemental figure 1 for patients with axSpA and online supplemental figure 2 for patients without axSpA, respectively.
Contribution of each variable to the accuracy of the machine learning-based model
The contribution of each variable to the accuracy of the model in differentiating between axSpA and non-axSpA is given in table 2 and visualised in figure 2. HLA-B27 was ranked highest (0.055), which means the accuracy of the model would suffer most from not using HLA-B27 in the model, followed by insidious onset of low back pain (0.038) and erosions on MRI of the SIJ (0.037). Elevated CRP came next in the ranking (0.018), preceding fat metaplasia on the SIJ MRI, arthritis, ankylosis on the SIJ MRI, awakening in the second half of the night due to low back pain, a positive family history for axSpA, a good NSAID response according to the patient’s opinion, bone marrow oedema of the SIJ on MRI, uveitis and improvement of low back pain with exercise (all >0.005). Other variables, including age and sex, contributed less to the accuracy of the model, with an MDA of under 0.005 (table 2).
The patients’ characteristics in the test dataset stratified by the diagnosis made by the rheumatologist and the assignment made by the model, which is visualised by a radar chart in figure 3, are shown in table 3. In general, patients without axSpA who were wrongly assigned by the model as having axSpA presented similarly to those patients who were correctly identified by the model as having axSpA. Those patients with axSpA who were wrongly assigned by the model as not having axSpA presented similarly to those patients who were correctly identified by the model as not having axSpA.
Sensitivity analysis
The analysis was repeated on imputed data to include the total population of 939 patients. A total of 659 and 282 patients were included in the training and test data, respectively, with proportions of 70% axSpA and 30% non-axSpA in both datasets. This model reached an accuracy of 0.8936, a sensitivity of 0.9545, a specificity of 0.7500, an F1 score of 0.9264 and an AUC of 0.9653. The contribution of each variable to the model’s accuracy, measured by MDA, was similar to the complete case analysis, with HLA-B27 ranking highest (0.044), followed by the insidious onset of low back pain (0.033), erosion (0.032) and elevated CRP (0.027). Moreover, different hyperparameters were investigated, leading to similar results (data not shown).
Subanalyses
First subanalysis
In the test data, 116 out of 209 patients were categorised as having axSpA after step 1 (without using MRI information), while the remaining 93 were assessed at step 2 (additionally using MRI information). This two-step approach achieved an accuracy of 92.34%, a sensitivity of 95.86%, a specificity of 84.38%, an F1 score of 92.34% and a ROC-AUC of 95.64%.
Second subanalysis (without HLA-B27 information)
In the test data, 99 of 209 patients were categorised as having axSpA after step 1 (without using MRI information), while the remaining 110 were assessed at step 2 (additionally using MRI information). This approach yielded an accuracy of 93.30%, a sensitivity of 97.24%, a specificity of 84.38%, an F1 score of 95.27% and a ROC-AUC of 94.08%.
Both two-step approaches demonstrated comparable performance to the main model while reducing the need for MRI data in approximately half of the cases (and the use of HLA-B27 information completely in the second subanalysis).
Discussion
In this study, we present a machine learning-based diagnostic model that differentiates between axSpA and non-axSpA in patients with chronic back pain. The model was built using clinical, laboratory and imaging characteristics as evaluated in a daily practice scenario of a SpA-specialised clinic. It shows a high performance, very close to 100% in terms of the testing ROC-AUC. Variables that contributed most to the accuracy of the model, in descending order, were HLA-B27, insidious onset of low back pain, erosion and elevated CRP.
Except for the report on the insidious onset of back pain, which is a patient self-reported variable, other variables contributing most to the model’s accuracy were objective variables, such as laboratory or imaging characteristics. HLA-B27 contributed most to the model’s accuracy. While a positive HLA-B27 test alone is not sufficient to confirm a diagnosis of axSpA, nor does a negative result for HLA-B27 rule out the possibility of axSpA; testing for the presence of HLA-B27 can nonetheless be beneficial in the diagnostic process of patients presenting in a specialised clinic due to their chronic back pain. HLA-B27 plays a central role in the algorithm adopted by ASAS, and it has been shown to have high likelihood ratio estimates in the Caucasian population.9 19
The second-most contributing variable in our model was the insidious onset of low back pain. It is a symptom that belongs to the definition of inflammatory back pain, the presence of which has been shown to increase the probability of axSpA from 5%20 to up to 30%21 among patients with chronic back pain. Despite the low specificity of inflammatory back pain with a range of 25.1%–43.9%,22 making its presence alone not sufficient for the diagnosis of axSpA, its relatively high sensitivity ranging from 74.4% to 81.1%22 makes it useful in the diagnostic approach for axSpA among patients with chronic back pain. Besides contributing to inflammatory back pain, the presence of insidious onset helps to distinguish axSpA from other conditions that have a more abrupt or acute onset.
The third variable in order of contribution to the model’s accuracy was erosions as seen on MRI examinations of the SIJ. A recent study has shown that the number of SIJ quadrants affected by erosion is significantly higher in patients with axSpA than in patients without axSpA.23 Furthermore, in cases of osteitis condensans ilii, a condition that may manifest with symptoms resembling axSpA, bone marrow oedema can be observed on MRI of the SIJ, while erosions remain absent.24
The main limitation of our study is the generalisability of the results also for settings outside a specialised rheumatological clinic. One needs to consider that the pretest probability for achieving a diagnosis of axSpA in a specialised clinic is strongly increased as compared with the situation of a private practice, especially a general practice, but also as compared with a rheumatological or orthopaedic practice. This increased pretest probability may lead to selection bias of the patients who were included in the datasets used here, which, in consequence, also influencesthe accuracy of the model. The high pretest probability for axSpA in our population can also be seen by the higher proportion of patients having the diagnosis as compared with those who did not have it, something that would be expected to show the opposite results in a primary or secondary clinical setting (as mentioned above, the prevalence of axSpA among patients with chronic back pain was estimated to be around 5%20). In addition, the interpretation of the images is an important issue that needs special recognition in such analyses. In our study, the reports of experienced radiologists were taken into account. This might represent a bias due to the expertise of the radiologists in a specialised centre. However, this, on the other hand, also accounts for routine care procedures where expertise in the clinical question of image interpretation is required. Overall, since we believe that the diagnosis of axSpA should indeed be made by specialists (in this case, the experienced rheumatologist) independent of the clinical setting, the application algorithm provided here can, under these conditions, be considered very useful also outside of a specialised hospital, in cases of suspicious but not directly obvious diagnosis. In addition, our data may be of lesser importance in patients seen in geographical areas with a lower prevalence of HLA-B27 than what is expected to be seen in a central European country like Germany. Nonetheless, it might also well be that exactly due to such differences in prevalence, this model may also work well in these areas just because of the importance of these parameters in these populations—something that needs to be examined in future studies. Another limitation that will be addressed in future studies is the validation of the diagnostic model using external data from a similar clinical setting.
Overall, we are here to provide a machine learning-based model that performs very well in diagnosing or excluding axSpA in tertiary rheumatology practice, overruling the so far published algorithms that were not based on clinical information only. The next step involves conducting additional validation using external datasets to assess the model’s diagnostic capabilities within diverse daily routine settings, offering insights for future applications.
Data availability statement
Data are available upon reasonable request.
Ethics statements
Patient consent for publication
Ethics approval
This study involves human participants. The Ethical Committee of the Ruhr-Universität Bochum, Germany, approved the study (reference number 20-6939-andere Forschung erstvotierend). Participants gave informed consent to participate in the study before taking part.
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
X @no twitter
IR and ST contributed equally.
Contributors IR: statistical analysis, interpretation of results and writing of the manuscript. ST: organisation of the project, interpretation of results and review of the manuscript. JE: data collection and review of the manuscript. UK, IA and PS: interpretation of results and review of the manuscript. XB: idea, organisation of the project, interpretation of results, writing of the manuscript and guarantor of the data. We developed a model using random forest, a machine-learning algorithm.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.