Genetics of RA susceptibility, what comes next?

Summary Genome-wide association studies (GWASs) have been used to great effect to identify genetic susceptibility loci for complex disease. A series of GWAS and meta-analyses have informed the discovery of over 100 loci for rheumatoid arthritis (RA). In common with findings in other autoimmune diseases the lead signals for the majority of these loci do not map to known gene sequences. In order to realise the benefit of investment in GWAS studies it is vital we determine how disease associated alleles function to influence disease processes. This is leading to rapid development in our knowledge as to the function of non-coding regions of the genome. Here we consider possible functional mechanisms for intergenic RA-associated variants which lie within lncRNA sequences.

Rheumatoid arthritis (RA) is a polygenic autoimmune disease that is characterised by chronic inflammation of the synovial joints, and associated with the presence of anticitrullinated protein antibodies (ACPA). For over 40 years it has been well established that there is a strong association between susceptibility to RA and human leucocyte antigen (HLA) DRB1 alleles that share a consensus sequence, known as the 'shared epitope'. 1 Spanning position 70-74 of the β subunit of the HLA DR molecule, this association accounts for approximately 40% of the genetic component of susceptibility of RA, and has recently been refined following imputation of amino acid sequences from high-density single nucleotide polymorphism (SNP) data. 2 It is only since the introduction of SNP-based genome-wide association studies (GWAS) in 2007 that an extensive list of additional genetic risk loci, have been associated with RA. 3 Currently, over 100 risk loci have been defined, explaining approximately half of the genetic variance in RA. 4 5 The challenge in this post-GWAS era is to interpret and translate these genetic findings, defining how they influence disease susceptibility and outcome. Signals from GWAS typically comprise large numbers of highly correlated SNPs defining regions that contain, on average, four genes, thus the remaining challenge is to identify for each locus the functional variant(s), the genes on which they act and the specific cell type in which dysregulation contributes to disease susceptibility.
In addition to providing insights into RA pathogenesis, understanding genetic susceptibility at this level has the potential to impact on both drug discovery and the optimal implementation of established therapies through a greater understanding of disease heterogeneity. Thus far, functional studies to further characterise implicated genetic regions in disease have focused almost exclusively on associated variants in protein-coding genes, however, these account for only a small minority of associated loci. Evidence suggests that for RA and many other complex diseases the majority of associated variants map outside protein coding regions.

INTERGENIC RA LOCI
In contrast to monogenic traits, an enrichment analysis performed using data from over 100 GWAS publications has shown that SNPs associated with complex traits are intergenic and do not map to protein sequences. 6 Traditionally, the functional implications of intergenic RA loci have been attributed to the nearest protein-coding gene known to have immunological relevance. This assumption, that genetic polymorphisms alter the expression of neighbouring genes by affecting cis-regulatory elements, such as enhancers and promoters, has been demonstrated to be misguided at a number of loci. For example, an association between the chromosome 16p13 region and a number of autoimmune diseases has been shown. This association was originally attributed to CLEC16A, with the majority of disease associated SNPs found within the transcribed region of this gene, especially within intron 19. CLEC16A is expressed in a variety of immune relevant cells and contains an immunoreceptor tyrosine-based activation motif. Functional studies have, however, revealed a long range chromatin interaction between intron 19 of CLEC16A and regulatory sequences that affect the expression of DEXI, found ∼140 kb from the SNPs. 7 The focus of research at this locus has now moved to DEXI as an autoimmune candidate gene. A similar example, where the SNPs in question are intergenic, rather than intronic, can be found at the chromosome 1p13 region, which is associated with low-density lipoprotein cholesterol and myocardial infarction. Here a disease associated SNP disrupts a transcription factor binding site found between CELSR2 and PSRC1, which inhibits the expression of SORT1, found ∼35 kb from the putative causal SNP. 8 Evidence is now emerging that the variants most strongly associated with autoimmune disease (lead SNPs) are enriched in areas of the chromosome that; are open and active, are defined by epigenetic marks, regulate gene transcription, and act in a cell-specific manner. Indeed, a recent study of 21 autoimmune diseases by Farh et al, 9 described an enrichment of putative causal SNPs in immune-cell enhancers. The majority (∼60%) of these SNPs, associated with RA and 20 other autoimmune diseases, map to enhancer elements that are active in lymphoblasts, lymphocytes and macrophages. The RA-associated enhancers are enriched for; cell specificity (T and B cells), activity in response to stimulation, and size (large 'super-enhancers'). Furthermore, they are associated with the expression of non-coding RNAs (enhancer-derived RNAs, eRNAs also known as elncRNAs). This was shown to be true within the subset of RA-associated super-enhancers in the analysis by Farh et al, but is best illustrated by Hnisz et al 10 in one of the first papers to describe super-enhancers. Super-enhancers, consist of clusters of enhancer elements that play a key role in defining cell identity and are associated with a 24.3-fold enrichment of RNA, compared to typical enhancers. Fahr et al found a 3.2-fold enrichment of RA SNPs in super-enhancers compared to typical enhancers and speculated that these SNPs could elicit subtle and stimulation specific changes in gene expression. Defining the function of these ncRNA expressing regions will be pivotal if we are to translate GWAS findings into mechanisms of susceptibility.
By recruiting various proteins, enhancers mediate chromosomal interactions, enabling the assembly of transcriptional machinery at promoter regions. It is well established that these regulatory elements do not necessarily affect the transcription of neighbouring genes, can function over large distances and generally affect the expression of more than one gene. 11 One of the hallmarks of these regulatory regions is accessible chromatin, as indicated by regions of DNase I hypersensitivity.
Other markers that offer greater specificity include chromatin modifications: Enhancer regions are associated with high levels of histone 3 lysine 4 monomethylation and dimethylation, relative to trimethylation (H3K4me1, H3K4me2 and H3K4me3, respectively), with active enhancers characterised by histone 3 lysine 27 acetylation (H3K27ac). 12 Lead disease-associated variants, situated as they are in gene regulatory regions, may have a number of functional consequences that contribute to increased risk of RA. These include disruption of transcription factor binding sites and chromatin interactions, as at the SORT1 and DEXI loci, as well as other mechanisms of disrupting regulatory features such as enhancers, silencers, promoters and insulators. Examples include potential impacts on the deposition of epigenetic markers and RNA binding. It is also possible for non-coding SNPs to affect the function of non-coding RNA transcripts, such as those of eRNA of lncRNA, by altering their sequence or expression pattern.
LNCRNA lncRNA are traditionally defined by an arbitrary lower limit of 200 bases, with over 100 000 transcripts identified in humans. 13 Their diverse functions and characteristics have only recently attracted the attention of the scientific community (see box 1 for a description of various lncRNA categories). Further classification of lncRNA has focused on genomic content, with over a third of transcripts found in intergenic regions (long intervening non-coding RNA, lincRNA). While advances in RNA sequencing (RNA Seq) techniques can account for many of the lncRNA species identified to date, a large number of lincRNA have been identified by having a similar chromatin signature to that of other actively transcribed genes. Typically, trimethylation of histone 3 lysine 4 is (H3K4me3) is observed at the promoter, with histone 3 lysine 36 trimethylation (H3K36me3) found along the transcribed length.
lincRNAs are typically transcribed by RNA Polymerase II and often undergo polyadenylation and splicing, just like protein-coding transcripts. 14 This is, however, not the case for some eRNAs, which are subdivided according to whether they are unidirectionally or bidirectionally transcribed (1D-eRNA, 2D-eRNA). eRNA are generally unspliced, with only the less common 1D-eRNAs undergoing polyadenylation.
In order to identify RA-associated loci where increased risk is likely to be mediated by lncRNA we have compiled a list of RA-associated regions that contain lincRNA and do not contain protein-coding genes. Regions of high linkage disequilibrium (encompassing all SNPs in linkage disequilibrium, r 2 =0.8 with the lead GWAS variant) were defined around the 102 regions associated with RA and computationally intersected with a database of lincRNA (using bedtools). 15 This was performed with all lncRNAs annotated as intergenic in the lncRBase database. 13 Of the 25 regions returned, 12 contained lncRNA that originate from Human Body Map RNA sequencing data, 16 with the provenance of the remaining lncRNAs generally being Expressed Sequence Tags (ESTs) found in the NONCODE database. 17 Manual curation, using UCSC genome browser, 18 resulted in a list of 9 lncRNA containing RA regions that are sufficiently distant from protein-coding transcripts as to make any direct influence on their expression an unlikely consequence of disease-associated variants. A list of these regions is provided in table 1.
Fully characterising these lncRNA and the impact RA-associated variants have on their function will be an important area of research aimed at elucidating mechanisms of disease susceptibility. Generalised functional models for lncRNA are difficult to impose due to the variety of mechanisms observed, however, most functional lncRNA seem to act as scaffolds, regulating the formation of complexes that can be composed of protein, DNA and other RNA. 14 20 In brief, some of the complexes that lncRNA have been implicated in include the mediator complex 21 and histone modifiers, such as Polycomb Repressive Complex 2 (PRC2). 22 Frequently, lncRNA are implicated in regulating the transcription of other genes, in cis and in trans. 23 Further complications include evidence that the act of transcription from certain lncRNA loci can be functional regardless of the transcript produced. 24 Evidence supporting a role of lincRNAs in autoimmune disease is beginning to accumulate as this field attracts the interest of the scientific community. This evidence includes observations of differential regulation of lincRNA in pathways relevant to RA. For example in CD14+ monocytes isolated from patients with RA and stimulated with tumour-necrosis factor-α or interleukin 6, two non-overlapping subsets of lincRNA were found to be differentially regulated (60 and 25 lincRNAs, respectively). 25 A separate study has linked changes in lincRNA expression to genotype using GWAS data, introducing the concept of lincRNA expression quantitative trait loci (lincRNA eQTLs). In general, these eQTLs were tissue specific and affected only lincRNAs and not a downstream regulated protein coding gene. 26 As for enhancers, these findings encourage a stimulus and tissue-specific approach to investigating the role of lncRNA in RA.

INVESTIGATING LNCRNA
Many online resources are available that enable the prioritisation of candidate lncRNAs based on publicly available data (table 2). These include tools to overlay disease-associated SNPs with lncRNA, investigate the provenance and expression of these lncRNAs and to examine other genomic features of these loci, such as methylation, chromatin modifications and transcription factor binding.
Once genetically-associated lncRNA are identified and investigated bioinformatically, a wide variety of molecular biology techniques can be used to interrogate their function. In many cases the techniques are similar to those used to investigate the transcripts of coding genes, however, several additional considerations exist. 34 It is important that experiments are designed in order to distinguish between loci that transcribe functional RNA, and loci where the process of transcription is important for the function of underlying regulatory elements. Perturbation experiments, including depletion and overexpression, are valuable for assessing RNA function, with genetic engineering a necessary component of experiments designed to investigate the role of genomic elements.  This selection of resources is not exhaustive, but presents a broad spectrum of available and up to date tools. Included is an example of the nomenclature preferred by each database, with all entries referring to the lncRNA HOTAIR. *HOTAIR is not found in lincPoly.
Expression of lncRNAs can be investigated, as for any other transcript, by quantitative PCR (qPCR), microarray, RNA sequencing (RNA-Seq) or fluorescence in situ hybridisation (FISH). A modification of the protocol used to create probes for FISH can be used to simultaneously generate tools for depleting lncRNA expression. 35 Commercial locked nucleic acids (LNAs) are also available for lncRNA depletion, with overexpression easily achieved using vector-based systems, similar to those used for protein-coding genes.
Several immunoprecipitation techniques exist that enable the identification of lncRNA-binding partners (examples of these techniques and references to their use are found in table 3). Given the propensity of lncRNAs to act as scaffolds, these experiments are of clear importance. When lncRNA are enhancer-derived, it is also wise to consider investigating their influence on DNA-DNA interactions. A variety of techniques exist for identifying and investigating enhancers and may well form the basis of further experiments. 36

POTENTIAL IMPACT
This success in identifying a large number of genetic associations with RA is yet to be translated into a thorough understanding of its aetiology and pathogenesis. The challenge we now face is to pinpoint the disease-associated variants and understand how they dysregulate normal biology and lead to the development of disease processes (figure 1). One of the emerging concepts in post-GWAS studies is the discovery of an All methods use biotinylated oligonucleotides as probes against the lncRNA of interest and formaldehyde cross-linking, unless otherwise stated. Figure 1 Graphical summary. (A) Genome-wide association studies single nucleotide polymorphism (GWAS SNPs) for complex diseases are mainly found outside of protein coding regions, and often overlay lncRNAs and/or enhancer elements. In this example, SNPs are found overlapping and proximal to three transcripts (curved lines); including two lncRNA, one of which is enhancer derived (red) and a mRNA (blue). (B) A detailed investigation of how these variants contribute to increased disease risk will have a profound impact on our understanding of the disease and is likely to implicate enhancers and lncRNA. lncRNAs (eg, green curved line) often function by regulating the transcription or translation of protein coding genes. One model of how this may occur is the mediation of chromatin interactions (eg, red curved line), which may involve the recruitment of chromatin modifiers, transcriptional activators or other proteins (green and orange forms, RNA pol II is represented by the purple form). enrichment of associated SNPs in intergenic regions, active in specific-cell types. We have described how bioinformatics analysis of RA loci implicates lncRNA and eRNA regions and propose that investigation of these loci may reveal novel insights into the pathogenesis of RA. Thorough investigation of how disease-associated SNPs affect lncRNA, in a cellular and stimulation-specific manner, has the huge potential to link downstream genes, pathways and cellular outcomes empirically to disease causing variants. This is likely to implicate specific enhancer elements, regulatory targets and cell types in different diseases and will have an obvious impact on the understanding on complex disease aetiology, especially where these components are shared.