Analysis of RNA expression is of the most important bioinformatics tasks. However, with RNA-seq many things can go wrong which makes expression analysis very tricky. In this tutorial we provide quite a detailed guide to RNA-seq mapping and explain some of the important factors you need to consider when doing mapping. You are going to touch a fascinating RNA-seq dataset obtained from a human brain tissue and used to study changes in gene expression patterns during aging in human.
RNA-seq is a next generation sequencing method that allows us to obtain a snapshot of the RNA present in a sample and estimate its abundance. RNA-Seq provides a comprehensive gene expression profile and helps to quantify and annotate genes and isoforms. The ability to quantify the level at which a particular gene is expressed in a cell, tissue or organism provides us with valuable biological information. For example, measuring gene expression can help to:
- Identify viral infection of a cell (viral protein expression);
- Determine an individual's susceptibility to cancer (oncogene expression);
- Find if a bacterial strain is resistant to penicillin (beta-lactamase expression).
Ideally, measurement of expression should be done by detecting the final gene product (for many genes this is a protein); however, technically it is often easier to detect one of the protein precursors — typically mRNA — and infer gene expression level from there.
Several important factors make analysis of RNA-seq data complex:
- Most of the sequencing platforms only allow for up to 400 bp read length (but see PacBio and Oxford Nanopore). Therefore, reads are generally too short to cover an expressed gene region entirely and are thus called ‘partial transcripts’.
- Some fraction of the sequencing reads in an RNA-seq experiment align to non-contiguous segments of the genome. Such reads are called "junction reads" — that is, reads that span the site of a splice in mRNA. Junction reads allow us to identify sites of alternative splicing, but can be complex to map and identify.
- In RNA-seq experiments, there are some sources of systematic variation that should be eliminated from RNA-seq data before the differential expression (DE) analysis. In particular, such variations include between-sample differences such as library size (sequencing depth) or within-sample differences, for example, in gene length, guanine-cytosine (GC) content or unwanted variation introduced by the batch effect.
A critical step in the RNA-seq data analysis is the alignment of partial transcript reads to a reference genome sequence. Reference-based alignment methods utilize the sequence of each read to find a potential mapping location either by an exact match for a reference or by scoring sequence similarity.