Question

What Can I Do With Rna-Seq Samples With Major Genomic Contamination?

12

Entering edit mode

14.0 years ago

Ryan Thompson ★ 3.7k

I have several lanes of paired-end Illumina RNA-Seq data in mouse, but in some lanes, less than 20% of reads map to known exons, indicating that much of the other 80% is likely contamination by genomic DNA (rather than cDNA derived from RNA). On the other end, the "best" sample has almost 80% of reads mapping to known exons.

What is a typical value for fraction of genomic contamination in an RNA-Seq dataset? Can I do anything useful with a lane of RNA-Seq where 80% of the reads aren't RNA-derived? How about 50%? 30%? 20%? I was hoping to use these to study alternative splicing, but I assume that the genomic reads would cause many false-positive cases of intron inclusion and alternative 3' and 5' splice sites. Could I still study other types of splicing events such as exon skipping and cassette exons, since these types of splicing variations would result in long insert lengths that would not be confused with genomic DNA?

rna • 9.2k views

ADD COMMENT • link updated 13.9 years ago by Michael 55k • written 14.0 years ago by Ryan Thompson ★ 3.7k

3

Entering edit mode

how do you know it isn't rRNA?

ADD REPLY • link 14.0 years ago by Jeremy Leipzig 23k

0

Entering edit mode

I didn't personally compute the statistics, but I believe that for the purposes "known exons" included non-protein-coding transcribed sequences such as ribosome genes.

ADD REPLY • link 14.0 years ago by Ryan Thompson ★ 3.7k

0

Entering edit mode

I didn't personally compute the statistics, but I believe that for the purposes of this calculation, "known exons" included non-protein-coding transcribed sequences such as ribosome genes.

ADD REPLY • link 14.0 years ago by Ryan Thompson ★ 3.7k

0

Entering edit mode

I've talked to the person who computed the statistics. Since there are many copies of the ribosomal DNA in the genome, any ribosomal reads would be filtered out because they align to too many locations.

ADD REPLY • link 14.0 years ago by Ryan Thompson ★ 3.7k

score 3 · Answer 1 · 2011-07-06

3

Entering edit mode

14.0 years ago

Bart Aelterman ▴ 110

It depends on how you want to study alternative splicing I guess. If you are looking for new splicing events, you might still be able to find new exon-exon boundaries that are covered high enough to be reliable. If you are looking for exon skipping for instance, it depends on the coverage of your known exons. If your exons are not covered high enough (say, quite some exons are only covered 1, 2 or 3x), it might become hard to make a difference between an exon that was skipped by alternative splicing and an exon that was simply not covered.

Having said that, I have to say that I have no idea on a typical value for genomic contamination in RNA-Seq datasets. Additionally, those 80% of contaminating sequences, as Jeremy raises, could be rRNA, but also new exons (which are of interest for you) or other noncoding RNA (asRNA, lncRNA, there are plenty of them).

ADD COMMENT • link 14.0 years ago by Bart Aelterman ▴ 110

1

Entering edit mode

The data is paired-end, so evidence of exon skipping would be pairs mapping to non-adjacent exons, not lack of coverage.

ADD REPLY • link 14.0 years ago by Ryan Thompson ★ 3.7k

1

Entering edit mode

Also, I realize that some fraction of the "non-exonic" reads are probably novel exons, which is why I don't simply discard all reads that map outside of known exons.

ADD REPLY • link 14.0 years ago by Ryan Thompson ★ 3.7k

score 2 · Answer 2 · 2011-07-06

I throw in my 50 cent, how I deal with a situation that seems a bit strange:

Unfortunately in this situation, you cannot trust anybody, try to do the alignments yourself.
Forget about the known exons for now, transcripts could map anywhere: introns or integenic, large non-exonic parts of the genome have been shown to be transcribed (I think it was the ENCODE project for human). So it might as well be a normal finding.

That said, I would try different alignment programs on a subset of reads from each lane, check for adaptor contamination etc. and align against the whole genome. Do your transcripts map to the genome, at all? I would think that the biologists protocols to remove genomic DNA should be quite good, if they used a kit or a standard protocol. So I would think that DNA contamination is rather unlikely unless they totally messed up. One way to tell would be to look for rDNA genes. They should have much higher coverage than the rest, even after mRNA amplification, if they don't stick out, then it might be DNA contamination. Hope this helps