Question: Using rna-seq, why map reads to the genome but calculate gene expression by aligning to the transcriptome?
gravatar for jgbradley1
4.4 years ago by
United States
jgbradley1100 wrote:

Skip to the last paragraph to go straight to the question. I am interested in looking at variants from a rna-seq sample of GM12878 and comparing to variants found by GIAB. The basic outline of my pipeline is the following

1) Trim reads with ngShoRT.

2) Align reads with STAR to the reference genome. Keep only uniquely mapped reads.

3) Call SNPs with samtools/bcftools

Comparing to GIAB, ~60% of the discovered variants match up with GIAB (is that a reasonable amount with just 1 sample?).

To understand why some of the variants found by samtools/bcftools don't match up to GIAB, I would like to focus only on SNPs within highly expressed genes. To calculate gene expression, I want to use TPM values from Salmon. I've learned that popular quantification tools (Salmon, RSEM, etc.) are based on alignments to the transcriptome. However, only 82% of the reads aligned by STAR overlap an exon. Related question: previously been addressed. So why should I trust a quantification tool that is based on the transcriptome when 18% of the reads don't align to an exon? 

transcriptome rna-seq genome • 5.8k views
ADD COMMENTlink modified 16 months ago by Tom K20 • written 4.4 years ago by jgbradley1100

I do not know about Salmon, but you are wrong about RSEM: it may estimate read counts either from a transcriptome, or using a genome + GTF file with gene annotations.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by h.mon28k

Salmon works the same way as RSEM in this regard.  For example if you run RSEM with the genome and a GTF file, you first run rsem-prepare-reference which extracts the transcripts from the genome, and then you align to that transcriptome.  RSEM doesn't deal with alignments to the genome directly.  Conversely, with a tool like STAR, you can also "project" genomic alignments (with the help of a GTF) directly to transcriptomic coordinates.  In this case, RSEM, Salmon and eXpress are capable of processing these alignments.

ADD REPLYlink written 4.4 years ago by Rob3.6k
gravatar for Brian Bushnell
4.4 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

You shouldn't.  Aligning to the genome is safer, unless you have an absolutely complete transcriptome.  Even then, differential splicing makes the transcriptome subjective.

ADD COMMENTlink written 4.4 years ago by Brian Bushnell17k

Well, it should be noted that if you use a tool like STAR, etc. to align directly to the transcriptome you often get a better mapping rate (for the transcriptome) than if you map to the genome directly and project those alignments.  There are a number of reasons for this, but one of them is that you can often recover reads that would otherwise span exon junctions in difficult and precarious ways (since they are contiguous in the transcriptome).  That being said, you will miss reads that map outside the reference transcriptome or in places where the reference transcriptome is wrong.

However, one of the core motivations for tools like Salmon, RSEM, eXpress etc. working in the way they do is that it allows a uniform interface to quantification in both reference and de novo transcriptomes.  In the reference case, I can use a GTF to extract the transcripts to their own fasta file.  However, in the de novo case since I'm mapping to a set of sequecnces that I believe to be a direct proxy for my transcriptome, a reference genome is not necessary (and in such cases, not possible).

ADD REPLYlink written 4.4 years ago by Rob3.6k

I just found this discussion about the same issue of gene quantification using the genome .vs transcriptome. I agree with Alex Dobin's opinion when he said

"...there are dangers in mapping the reads only to the annotated transcript sequences, since a large number of reads (~25%) usually map outside of spliced transcripts (mostly in the introns). These reads will be "forced" to map to the transcriptome possibly creating false positives...the original Cufflinks paper emphasized the importance of accounting for unannotated transcripts for accurate quantification"

I guess you see a better mapping rate when aligning to a transcriptome because other novel splice patterns/junctions are ignored. But are the reads that you "recover" really supposed to be aligned to one of the annotated transcripts or were they only forced to map to a transcript even though there was a better mapping at some intronic region in the genome? I don't know whether using the transcriptome to recover reads (that map across exons in a weird way) improves the accuracy of calculating the true gene expression or not.


I haven't tested this yet, but assuming you have a good genome reference, my opinion is that it would always be superior to calculate gene expression by aligning to the genome vs. aligning to a transcriptome due to the risk of novel splicing. The intronic reads could be a big source of error, so align reads to the genome, throw away the intronic reads, then calculate gene expression either manually or with any other tool that aligns to the transcriptome. I think aligning the filtered set of reads to a transcriptome then would produce more accurate gene expression values.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by jgbradley1100

How do you know if a read is intronic, and thus should be discarded, as opposed to from an undiscovered exon?  Granted, if you look at the pileup in a program like IGV, it would be pretty obvious - nice, clean edges on exon/intron boundaries.  But that's hard to automate - of course there are programs that do so, but I don't know of any that do it well (I haven't used any in 3 years though so perhaps there are some now).

Optimally, I would map to the genome, attempt to find novel exons or noncoding genes by looking for unexpected coverage in unannotated regions, combine those with the existing annotation, then quantify by projecting the coverage onto the annotation.  Though if you are confident in the annotation, or the genome is very large and incomplete/inaccurate, examining the reads that map to unannotated features might be a waste of time.

ADD REPLYlink written 4.4 years ago by Brian Bushnell17k

For my specific use case, I am using ensembl annotations from biomart for human. If you know R, it's not too difficult to read in the annotations and then filter for the aligned reads that span an exon region.

ADD REPLYlink written 4.4 years ago by jgbradley1100
gravatar for Tom K
16 months ago by
Tom K20
Japan/Yokohama/RIKEN IMS
Tom K20 wrote:

Would you expect all reads to map to the "transcriptome"? The reference and annotation could always be missing some genes. It is also possible that the reads includes some non-coding RNAs depending on the sequencing technology. Another possibility is that the reads do arise from a transcript but weren't able to be definitively mapped to one if the reads were too short or correspond to a repetitive sequence. In my experience with STAR, 82% of reads mapping to the transcriptome is pretty good. There are 3 ways this could be improved:

1) Higher quality sequencing data with longer reads and fewer ambiguities

2) A more comprehensive reference genome/transcriptome and annotation

3) The mapping algorithm itself (or data pre-processing) could be excluding some reads and could be refined or the parameters modified.

None of these can be entirely ruled out and you cannot expect all reads to be mapped. Even in an ideal situation you will be filtering out some reads and you should to ensure high quality data.

ADD COMMENTlink written 16 months ago by Tom K20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1325 users visited in the last hour