Question

No change in number of reads and quality after removing rRNA contamination using sortmerna

0

Entering edit mode

6.8 years ago

Sumit Paliwal ▴ 40

Hi Everyone, I have done RNA-seq (75 bp paired-end; 50 million reads per sample) on some samples to assess differential expression. Initial FastQC report showed some sequence duplications issue. Despite that, I did the alignment and imported the resulting BAM files in Seqmonk for further analysis. RNASeq QC report in SeqMonk indicated rRNA contamination. So, I went back and removed the rRNA reads using sortmerna. In some cases, the rRNA read files are huge. However, when I do a FastQC analysis on the files without rRNA reads, I do not see a change in number of reads or sequence duplication levels compared to original raw files. I don't know how is it possible. Suggestions and solution please. Thanks.

Sumit

RNA-Seq SortmeRNA FastQC SeqMonk • 2.9k views

ADD COMMENT • link 6.8 years ago by Sumit Paliwal ▴ 40

2

Entering edit mode

I don't think rRNA contamination is a problem (unless it's a lot), but that might also depend on what you want to do with the data.

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

In my samples it ranges from 5%-25%. I think it is a significant amount.

ADD REPLY • link 6.8 years ago by Sumit Paliwal ▴ 40

1

Entering edit mode

How much rRNA contamination? 1%, 5% 10%, 50%? In general, there always is some small percentage of rRNA left, you only need to worry if is too much and decreases the number of non-rRNA reads to less than the minimum recommended for the analysis you want to perform. For example, rules of thumb for differential gene expression and differential transcript expression are 10million and 20million, respectively (I am not 100% about the 20million figure).

The other point is FastQC will almost always point there is duplication in RNAseq samples, because there will be some genes / transcripts which are very expressed.

ADD REPLY • link 6.8 years ago by h.mon 35k

0

Entering edit mode

rRNA contamination in my samples ranges from 5%-25%. With this level of contamination, I should see a difference in number of reads before and after processing files with sortmerna. My concern is that I do not see any difference, although I find that there are a significant number of reads that match with rRNA in sortmerna ouput files.

ADD REPLY • link 6.8 years ago by Sumit Paliwal ▴ 40

0

Entering edit mode

You will have to explain yourself better. What are the command lines you are using? If a sample has 25% of rRNA contamination and you use SortMeRNA on it, you should have two outputs, one with 75% of the reads, free of rRNA, and another with 25% of your reads, all rRNA.

ADD REPLY • link 6.8 years ago by h.mon 35k

0

Entering edit mode

Here is the image of the FastQC Basic Statistics prior to removal of rRNA reads: FastQC-before

SeqMonk RNASeq QC report suggesting rRNA contamination:

SeqMonk RNASeq QC

I used the following command to remove rRNA reads:

while read i
    do
        /home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/sortmerna --ref $SORTMERNA_DB --reads /media/windows/Sumit/RawMerged/${i}_merged.fastq --paired_in -a 8 --log --fastx --aligned /media/windows/Sumit/sampleoutput/${i}_rRNA --other /media/windows/Sumit/sampleoutput/${i}_sortmerna   
    done < /media/windows/Sumit/uniquelist.txt

Following is the result section from the log file generated from the above command:

Results:

Total reads = 48574678
Total reads passing E-value threshold = 18473367 (38.03%)
Total reads failing E-value threshold = 30101311 (61.97%)
Minimum read length = 76
Maximum read length = 76
Mean read length = 76

By database:

home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-bac-16s-id90.fasta 0.18%

/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-bac-23s-id98.fasta 0.18%

/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-arc-16s-id95.fasta 0.03%

/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-arc-23s-id98.fasta 0.01%

/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-euk-18s-id95.fasta 13.60%

/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-euk-28s-id98.fasta 23.96%

/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/rfam-5s-database-id98.fasta 0.01%

/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/rfam-5.8s-database-id98.fasta 0.06%

Following is the FastQC report on the file generated using SortMeRNA for the same sample: FastQC-after

As you can see that there are significant number of reads matching to Eukaryotic 18s and 28s. Still there is no change in reads in the second FastQC statistics. I hope the problem is clear now.

ADD REPLY • link 6.8 years ago by Sumit Paliwal ▴ 40

0

Entering edit mode

The problem was clear from the start, how you got to the conclusions was the blurry part.

It seems SortMeRNA is failing to split the files. Did you examine L-C-M6_rRNA-R1.fastq? Is it empty? Which version of SortMeRNA? What is the result if you run it with --paired_out?

I never used SeqMonk, but it seems it is under-reporting contamination: the graphics show less than 25% for (what I presume are) all samples, but SortMeRNA reported 38% for the sample you provided output.

I don't know what is going on, but let me suggest two tools to help you: MultiQC does aggregate reports for lots of tools, it does for FastQC, so it is easy to compare before and after results; and BBDuk is really fast at quantifying and removing contamination, being almost as sensitive as SortMeRNA but at a fraction of the time.

ADD REPLY • link 6.8 years ago by h.mon 35k