RNA-seq read alignment best practices for low-input Smart-seq2 samples
I think my data may fall in a nether region between single-cell and bulk...
There are many ways to align RNA-seq reads to transcripts and genomes, and I am confused about which is best for my experiment.
My goal is to do differential expression analysis on Plasmodium vivax data. The input is fastq files of paired 25nt reads from Smart-seq2.
The P.vivax genome is 29 Mb, 40% G/C, 6,642 genes. Introns are much smaller and fewer in number than human introns. P.vivax RNA is hard to get, so we used Smart-seq2 to generate sequence. Using bowtie and RSEM we find for our richer samples we get 4.2 million reads aligned to transcripts, and 86% transcripts with at least one read aligned. For our less rich samples we get 1.2 million reads aligned to transcripts, and 78% transcripts with at least one read aligned. rRNA accounts for < 10% of the aligned reads.
- Is my understanding correct that if I am primarily interested in differential expression of known genes, as opposed to finding new genes and fixing gene models, that I may align reads to transcripts as opposed to the whole genome?
There are so many different aligners: bwa, bowtie, bowtie2, STAR, HISAT2. How much does it really matter which one I use? Does using a different alignment program have major effects on the outcome, or is it more just playing around the edges?
How do I tell which is best for my data? What metrics ought I use to assess my data?
I have read articles reviewing best practices for evaluation RNA-seq data, for example PMID26813401. But I’m not sure how this translates to my data, which I think falls in a nether region between single-cell and bulk.
It is wonderfully convenient to run RSEM and have RSEM do the alignment itself. Is it better practice to do the alignment step outside of RSEM?
Thank you! Jon