Statistics & additional information

Database statistics

Full set High-confidence set
Transcripts 127,802 107,039
Genes 56,946 49,372

Genes are defined by grouping transcripts in same orientation with at least one partially overlapping exon.
High-confidence set transcripts do not show any coding potential by the different methods implemented in LNCipedia.

Transcript merging and filtering

First, transcripts from different sources are added to the database using custom import scripts. Two transcripts of which all the exon positions are identical, as such creating a nonredundant transcript collection. After each import, the chromosomal positions of the newly added exons are converted to either hg19 or hg38 using the liftOver tool and corresponding chain files provided on the UCSC genome browser website (https://genome.ucsc.edu/cgi-bin/hgLiftOver). Only conversions with perfect remapping of all bases are considered. Next, the metadata (conservation, coding potential, HGNC gene symbols, …) for each transcript is either obtained or generated. In the filtering step, the following transcript are removed from the database:

Following the filtering, all transcripts are clustered into genes based on exon overlap. These genes are then given a LNCipedia name.

Transcript clustering and naming

Different lncRNA transcripts are considered to belong to the same gene if they share at least one (partially) overlapping exon and reside on the same DNA strand. In this way, transcripts are clustered into genes.

If a lncRNA gene has an official gene symbol according to HGNC, that symbol is used as the primary ID (eg. HOTAIR). Transcripts in the gene are numbered, starting with the most upstream transcript (eg. HOTAIR:1).

If the lncRNA does not (yet) have an official gene symbol, we employ a universal lncRNA nomenclature based the gene symbol of the nearest protein coding gene to ease communication among researchers. These lncRNA genes are then named after the HUGO symbol of the nearest protein-coding gene on the same strand using the following scheme: ‘lnc-HUGO-#’. The lncRNA genes are numbered, starting with the lncRNA gene closest to the protein-coding gene. A second number is added to denote the different transcript variants starting with the most upstream transcript, for example, lnc-MYCN-1:1 denotes transcript 1 from gene lnc-MYCN-1 (more info).

GRCh38/hg38 reference genome

LNCipedia now supports both the hg19 and hg38 reference genomes. Switch to you prefered reference genome by selecting the reference genome from the “Genome” link in the menu, all genomic coordinates and links to other websites/tools will be updated to the corresponding reference genome. Exports are available for both reference genomes. Of note: positions are automatically converted using LiftOver, transcripts that do not have a unique position in both reference genomes or a different size will only be available in one reference genome.

UCSC trackhub

A UCSC trackhub is available at http://lncipedia.org/trackhub/hub.txt

LncRNA sources used

LncRNAdb (september 2011): 105 transcripts

The LncRNAdb contains lncRNAs identified from the literature in around 60 different species.

Broad Institute (Human Body Map lincRNAs): 14,279 transcripts

Human lincrna Catalog collected there data from RNA-seq across 24 tissues and cell types.

Ensembl release 92: 25,075 transcripts

Ensembl gene annotation, cDNA alignments and chromatin-state map data from the Ensembl regulatory build are used to predict lincRNAs for human and mouse. The data of human lncRNA's is imported to LNCipedia.

Gencode 13: 19,812 transcripts

The main data set combines the HAVANA manual annotation using evidence from various sources and research groups with the Ensembl automatic annotation pipelines to achieve an accurate and complete annotation of the human genome.

Refseq - Dec 2014: 4,774 transcripts
Refseq - NCBI Annotation Release 106: 5,487 transcripts

Each RefSeq (Reference Sequence) is constructed wholly from sequence data submitted to the International Nucleotide Sequence Database Collaboration. Only entries with property “biomol_ncrna_lncrna” were considered

Nielsen et al: 7,656 transcripts

Expression levels are evaluated across 12 human tissues(bladder, brain, breast, colon, heart, kidney, liver, lung, muscle, ovary, prostate and skin) using a custom-designed microarray, supplemented with RNAseq.
Various filters were applied:

  1. All probes were aligned to all protein-coding mRNAs using BLAST and probes with E-scores below 1 × 1e−10 failed.
  2. Probes overlapping a genomic region with more than 10 human chained self-alignments (Kent et al. 2003).
  3. Probes overlapping regions with mitochondrial homology.
  4. Probes overlapping repeatMask regions.
The following three filter rules were subsequently applied to all nc transcripts:
Nc transcripts with any probe failing filter 1 were discarded.
Nc transcripts with no probes passing filters 2, 3, and 4 were discarded.
Nc transcripts overlapping pseudogenes defined by GENCODE (V12) were discarded.
Collectively, this reduced the number of analyzed transcripts from 26,910 to 12,115.
After filtering the data for lncRNA's we added 5,339 transcripts to the database.

Hangauer et al: 5,339 transcripts

The data from this publication is collected from RNA-seq and performed de novo transcriptome assembly on each of the RNA-seq datasets to generate 6,833,809 de novo assembled transcripts. Transcripts were filtered, only long non-coding RNAs are added to the database.
Filter: Fragments per kilobase of transcript per million mapped reads(FPKM)>1

NONCODE: 93,164 transcripts

NONCODE data is collected from three sources:
1. Literature mining,
2. Specialized databases,
3. GenBank

Sun and Gadad et al., 2015: 2,305 transcripts

Abstract: We describe a computational approach that integrates GRO-seq and RNA-seq data to annotate long noncoding RNAs (lncRNAs), with increased sensitivity for low-abundance lncRNAs. We used this approach to characterize the lncRNA transcriptome in MCF-7 human breast cancer cells, including >700 previously unannotated lncRNAs. We then used information about the (1) transcription of lncRNA genes from GRO-seq, (2) steady-state levels of lncRNA transcripts in cell lines and patient samples from RNA-seq, and (3) histone modifications and factor binding at lncRNA gene promoters from ChIP-seq to explore lncRNA gene structure and regulation, as well as lncRNA transcript stability, regulation, and function. Functional analysis of selected lncRNAs with altered expression in breast cancers revealed roles in cell proliferation, regulation of an E2F-dependent cell-cycle gene expression program, and estrogen-dependent mitogenic growth. Collectively, our studies demonstrate the use of an integrated genomic and molecular approach to identify and characterize growth-regulating lncRNAs in cancers.

FANTOM CAT: 27,719 transcripts

Abstract: Here, using FANTOM5 cap analysis of gene expression (CAGE) data, we integrate multiple transcript collections to generate a comprehensive atlas of 27,919 human lncRNA genes with high-confidence 5′ ends and expression profiles across 1,829 samples from the major human primary cell types and tissues. Genomic and epigenomic classifications of these lncRNAs reveals that most intergenic lncRNAs originate from enhancers rather than from promoters. Incorporating genetic and expression data, we show that lncRNAs overlapping trait-associated single nucleotide polymorphisms are specifically expressed in cell types relevant to the traits, implicating these lncRNAs in multiple diseases. We further demonstrate that lncRNAs overlapping expression quantitative trait loci (eQTL)-associated single nucleotide polymorphisms of messenger RNAs are co-expressed with the corresponding messenger RNAs, suggesting their potential roles in transcriptional regulation. Combining these findings with conservation data, we identify 19,175 potentially functional lncRNAs in the human genome.

http://fantom.gsc.riken.jp/cat/

The stringent set of FANTOM CAT lncRNAs is included in LNCipedia with the exclusion of 34 transcripts that were in conflict with the HUGO gene boundaries.

Protein coding potential

Protein coding potential is assessed by means of two different prediction algorithms and a novel PRIDE database search algorithm.

CPC: Coding Potential Calculator

From the CPC website:
We developed a Support Vector Machine-based classifier, named Coding Potential Calculator (CPC), to assess the protein-coding potential of a transcript based on six biologically meaningful sequence features. 10-fold cross-validation on the training dataset and independent testing on three large standalone datasets showed that CPC can discriminate coding from noncoding transcripts with high accuracy.

HMMER: Biosequence analysis using profile hidden Markov Models using HMMER

We used the hmmscan algoritm against the pfam2 database to search for known protein domains in all 6 reading frames of the transcript. The number of found pfam domains is reported for both the 5' to 3' and 3' to 5' direction.

PRIDE: database search

We have re-analysed +100 Homo sapiens proteomics projects from the PRIDE database by searching MSMS spectra in standard UniProtKB/Swiss-Prot human database together with the translated version of lncipedia.

PhyloCSF: Coding Potential of a multi-species nucleotide sequence alignment

We use the PhyloCSF algoritm to benchmark the (non)coding Ensembl data. We achieved a specificity and sensitivity of 93%, the cutoff is 60.7876. A score lower than this cutoff means that the transcript is non-coding, above this cutoff it is likely to be coding.

CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model

We use the CPAT algoritm to calculate the coding probability based on the sequence of the lncRNA. The suggested coding probability cutoff of 0.364 is used, this cutoff corresponds to a sensitivity and specificity of 0.966.

Ribosome-profiling: Lee et al., 2012 and Bazzini et al., 2014

253 lncRNAs containing small open reading frames (smORFS) are provided by Bazzini et al., 2014. Bazzini and colleagues developed an approach to detect smORFs using ribosome profiling whereby the periodicity of ribosome movement on actively translated ORFs is used to distinguish coding from non-coding sequences.

A second approach to apply ribosome profiling in the quest for novel coding RNAs has been described by Lee et al., 2012. Using lactimidomycin, a ribosome inhibitor specific to initiating ribosomes, translation initiation sites (TIS) were mapped in HEK-293 cells.

Conservation

Locus conservation

Locus conservation is assessed by evaluating the positional conservation and order of the flanking protein coding genes. A human lncRNA locus is considered conserved when the flanking protein coding genes have flanking orthologues in another species, as assessed by the Emsembl Compara API. Currently locus conservation in mouse and zebrafish compared to human is provided. Our analyses suggest human locus conservation of 60% compared to mouse and 25% compared to zebrafish.

Transcript classification

Transcripts are classified based on their relative position to protein coding genes (Ensembl 84). The order in which the position is queried:

  1. Overlap with protein coding gene on the same strand:
    1. No overlap with protein-coding exons: intronic
    2. Otherwise: sense overlapping
  2. Overlap with protein coding gene on the oposite strand: antisense
  3. No overlap with any protein coding gene
    1. Transcription start site of protein coding gene on the other strand is within 1000 bp of lncRNA transcription start site: bidirectional
    2. Otherwise: intergenic