Question: What Can I Do With Rna-Seq Samples With Major Genomic Contamination?
11
gravatar for Ryan Thompson
8.3 years ago by
Ryan Thompson3.4k
TSRI, La Jolla, CA
Ryan Thompson3.4k wrote:

I have several lanes of paired-end Illumina RNA-Seq data in mouse, but in some lanes, less than 20% of reads map to known exons, indicating that much of the other 80% is likely contamination by genomic DNA (rather than cDNA derived from RNA). On the other end, the "best" sample has almost 80% of reads mapping to known exons.

What is a typical value for fraction of genomic contamination in an RNA-Seq dataset? Can I do anything useful with a lane of RNA-Seq where 80% of the reads aren't RNA-derived? How about 50%? 30%? 20%? I was hoping to use these to study alternative splicing, but I assume that the genomic reads would cause many false-positive cases of intron inclusion and alternative 3' and 5' splice sites. Could I still study other types of splicing events such as exon skipping and cassette exons, since these types of splicing variations would result in long insert lengths that would not be confused with genomic DNA?

rna • 6.8k views
ADD COMMENTlink modified 8.2 years ago by Michael Dondrup46k • written 8.3 years ago by Ryan Thompson3.4k
3

how do you know it isn't rRNA?

ADD REPLYlink written 8.3 years ago by Jeremy Leipzig18k

I didn't personally compute the statistics, but I believe that for the purposes "known exons" included non-protein-coding transcribed sequences such as ribosome genes.

ADD REPLYlink written 8.3 years ago by Ryan Thompson3.4k

I didn't personally compute the statistics, but I believe that for the purposes of this calculation, "known exons" included non-protein-coding transcribed sequences such as ribosome genes.

ADD REPLYlink written 8.3 years ago by Ryan Thompson3.4k

I've talked to the person who computed the statistics. Since there are many copies of the ribosomal DNA in the genome, any ribosomal reads would be filtered out because they align to too many locations.

ADD REPLYlink written 8.3 years ago by Ryan Thompson3.4k
3
gravatar for Bart Aelterman
8.3 years ago by
Brussels
Bart Aelterman110 wrote:

It depends on how you want to study alternative splicing I guess. If you are looking for new splicing events, you might still be able to find new exon-exon boundaries that are covered high enough to be reliable. If you are looking for exon skipping for instance, it depends on the coverage of your known exons. If your exons are not covered high enough (say, quite some exons are only covered 1, 2 or 3x), it might become hard to make a difference between an exon that was skipped by alternative splicing and an exon that was simply not covered.

Having said that, I have to say that I have no idea on a typical value for genomic contamination in RNA-Seq datasets. Additionally, those 80% of contaminating sequences, as Jeremy raises, could be rRNA, but also new exons (which are of interest for you) or other noncoding RNA (asRNA, lncRNA, there are plenty of them).

ADD COMMENTlink written 8.3 years ago by Bart Aelterman110
1

The data is paired-end, so evidence of exon skipping would be pairs mapping to non-adjacent exons, not lack of coverage.

ADD REPLYlink written 8.3 years ago by Ryan Thompson3.4k
1

Also, I realize that some fraction of the "non-exonic" reads are probably novel exons, which is why I don't simply discard all reads that map outside of known exons.

ADD REPLYlink written 8.3 years ago by Ryan Thompson3.4k
2
gravatar for Michael Dondrup
8.3 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

I throw in my 50 cent, how I deal with a situation that seems a bit strange:

  1. Unfortunately in this situation, you cannot trust anybody, try to do the alignments yourself.
  2. Forget about the known exons for now, transcripts could map anywhere: introns or integenic, large non-exonic parts of the genome have been shown to be transcribed (I think it was the ENCODE project for human). So it might as well be a normal finding.

That said, I would try different alignment programs on a subset of reads from each lane, check for adaptor contamination etc. and align against the whole genome. Do your transcripts map to the genome, at all? I would think that the biologists protocols to remove genomic DNA should be quite good, if they used a kit or a standard protocol. So I would think that DNA contamination is rather unlikely unless they totally messed up. One way to tell would be to look for rDNA genes. They should have much higher coverage than the rest, even after mRNA amplification, if they don't stick out, then it might be DNA contamination. Hope this helps

ADD COMMENTlink modified 8.3 years ago • written 8.3 years ago by Michael Dondrup46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1970 users visited in the last hour