Baseline expression

Baseline RNA and protein expression data helps us to ascertain whether the target is expressed in all tissues (e.g. housekeeping genes) or only in one or few tissues (cells or organs). This information is relevant at several stages of drug target identification and prioritisation, including during target safety and toxicological assessments.

We combine baseline expression information from three sources:

In this page, you will find details on how the data for RNA expression shown in the Summary view in the target profile page:

Summary of baseline expression data at both RNA and protein levels for target MB (myoglobin).

RNA expression meta-analysis

The Expression Atlas team performs a special analysis of six tissue-based and one cell type based RNA expression studies:

  • RNA-seq of 53 human tissue samples from GTEx (E-MTAB-5214)

  • RNA-seq of 16 human tissues from the Illumina Body Map project (E-MTAB-513)

  • RNA-seq of 13 human tissue from the ENCODE project (Snyder Lab) (E-MTAB-4344)

  • RNA-seq of 6 human tissues from Kaessmann Lab (E-MTAB-3716)

  • mRNA-seq of 32 human tissues from Human Protein Atlas (E-MTAB-2836)

  • mRNA-seq of rare types of cells of different haemopoetic lineages from healthy individuals in the BLUEPRINT project (E-MTAB-3819)

  • RNA-seq of common types of cells of different haemopoetic lineages from healthy individuals in the BLUEPRINT project (E-MTAB-3827)

  • mRNA-seq of plasma cells of tonsil from healthy individuals from the BLUEPRINT project (E-MTAB-4754)

In total that is more than 18,000 samples across more than 50 tissues and more than 30 cell types.

The tissue- and cell-based samples are processed separately to avoid batch effects during normalisation. The samples of each group are processed together to generate an expression table of normalised Transcripts Per Million (TPMs) for each gene in each tissue or cell type as follows:

  • Aggregation of technical replicates

  • Filtering lowly expressed genes: The threshold used is that the expression has to be at least 10 raw reads in at least 15 samples

  • Samples were normalised using the Remove Unwanted Variation (RUV) method (Risso et al. 2014) which is a two-step process:

    • The Coefficient of Variation (CV) was estimated for each gene across all the samples and used to select the least variable genes.

    • The least variable 1,000 genes were used as negative controls; that is, assumed not to be differentially expressed, to train RUVg to remove unwanted variation

  • Tissues were mapped to UBERON the Uber-anatomy ontology and the ones that did not match were discarded

  • Samples from the same tissue across different experiments were averaged by median to be merged in the final matrix

  • Finally, the expression tables of the tissue- and cell-based experiments were merged

We analyse this expression file further to compute two values for each gene:

  • Binned value of expression: The normalised expression values are divided into 10 bins of the same width. Note that this is not the same as the deciles, which all contain the same number of items in them

  • Tissue specificity: Z-scores are calculated for each gene and each tissue and then they are binned based on quantiles of a perfect normal distribution. This allows to extract the tissues for which a gene is specific, defined as the expression value being above the 75th z-score percentile - in practice, anything in bin 2 or above (more information in the FAQ section)