Hi all, I am new to the BioInformatics, and quite a beginner in programming languages. Can anyone suggest me some sources where I can at least learn 50% of the scRNA seq data analysis? I am familiar with C language, and I know a little bit of molecular biology too.
The Hemberg Lab has a very useful introduction to scRNA-seq analysis using R.
In terms of R packages and actually understanding why the objects are becoming fairly complex, the explanations from the bioconductor people are fairly insightful and the principles are applicable to Seurat, too (although the names of the accessor functions for, say, retrieving the matrix of read counts, will be different). The accompanying book is here.
Generally, there are these steps that the analysis will involve:
- Read alignment (FASTQ --> BAM), depends a bit on the type of data you have, for data from the 10X Genomics platform, they offer their CellRanger software, but there are other tools like alevin and STARsolo. This step is usually done for all NGS data, but it is slightly more complicated for single-cell data because the tools need to keep track of where each read came from (which cell and which transcript, if UMI were used)
- Count matrix generation: The first major goal is to obtain a matrix of read counts per gene, where rows usually correspond to genes and columns to cells. For single-cell RNA-seq, this is usually part of the alignment step.
- Filtering, Normalization, Batch correction, ...: this is where scRNA-seq becomes really frustrating, even for experienced bioinformaticians because there's no real consensus yet as to how scRNA-seq data is properly normalized. This is why many people will point you to
Seurat, which pretends it has it all figured out by providing functions that are aptly named
ScaleDataand if your data looks similar to what people have been working with, the default settings may work.
- Dimensionality reduction: tSNE, PCA, UMAP, ... These are techniques to allow you to represent the data in a xy-coordinates ( = 2 dimensions) rather than the original number of dimensions your count matrix will have (probably something like 30 000 genes x 10 000 cells).
- Clustering cells: usually done with graph-based methods because they seem to offer the best compromise between speed and accuracy for single-cell data, there are a couple of excellent reviews on the topic: Menon 2018, Kiselev 2019, and some benchmarking papers: Duo 2019, Freytag 2019
- Assigning labels to cells: this is usually the main goal of many scRNA-seq data sets these days and it's usually quite tricky, but in principle, we're expecting to see certain genes that are only expressed in certain clusters of cells (marker genes) and based on those we try to infer the "cell type". While not very technical in nature, I found the discussions by Jesse Gilles and Meghan Crow (here and here) quite insightful.
In short, as other have pointed out, scRNA-seq is really not ideal to start out as a bioinformatician because it's a fairly new data type and we're still grappling with all its intricacies and caveats. That being said, you may find more automated solutions like the one provided by the EPFL (asap) useful to play around with some data, just be cautious with making bold interpretations and claims.
Single-cell data are rather unpleasant as a beginner's topic due to the noisy and sparse nature of these data. Maybe better first analyze some bulk RNA-seq data to get familiar with R (see here), and then dive into the documentation of
Seurat which is the jack-of-all-trades in terms of scRNA-seq analysis. For lowlevel processing
alevin is a good choice.
Yes, Seurat would be one of the starting points. Beside the tutorials offered on Seurat web site, a while I have posted some R code on Seurat github page : https://github.com/satijalab/seurat/issues/1193 (hope it is helpful)