What is acceptable contaminant levels for RNA-Seq mapping?
0
0
Entering edit mode
19 months ago
Anand Rao ▴ 430

For my pipeline to pre-process RNA-Seq reads prior to reference genome mapping, I have assessed contaminant levels for various sequences using FastQ_Screen (from Babraham Bioinformatics, that brought us the widely used FastQC).

I have pasted below FastQ_Screen results for RAW READS (before any of the pre-processing steps) and FINAL PROCESSED READS (after all of my pre-processing steps have been completed)

Based on those 2 data tables below, could you please comment on whether:

1. contaminant levels in final processed reads are low enough to use for mapping to ref. genome?
2. Is it safe to assume that persistent low contaminant levels, for cat, dog, mouse, human will just contribute to un-mapped category during mapping to my plant target genome, rather than results in inaccurate mapping?
3. contaminant levels in raw reads were originally low enough, indicating this library was a decent sample to start off with?
4. the differences between raw reads and full processed reads suggest over-processing?
5. any other observation pops up that I have not considered even inquiring about...

Thank you!

Genome / Reference  #Reads_processed    #Unmapped   %Unmapped   #One_hit_one_genome %One_hit_one_genome #Multiple_hits_one_genome   %Multiple_hits_one_genome   #One_hit_multiple_genomes   %One_hit_multiple_genomes   Multiple_hits_multiple_genomes  %Multiple_hits_multiple_genomes
adapters    14198683    14189047    99.94   18  0   1   0   1833    0.01    7784    0.05
PhiX    14198683    14198683    100 0   0   0   0   0   0   0   0
lambda  14198683    14198683    100 0   0   0   0   0   0   0   0
UniVec  14198683    14183620    99.89   27  0   38  0   1496    0.01    13502   0.1
Bacterial_masked    14198683    14101866    99.32   200 0   62423   0.44    541 0   33653   0.24
Bact_Symbiont   14198683    14175648    99.84   2   0   110 0   18  0   22905   0.16
Mitoch  14198683    14127750    99.5    0   0   0   0   68693   0.48    2240    0.02
rRNA    14198683    12192293    85.87   0   0   0   0   380549  2.68    1625841 11.45
Target_Ref_genome   14198683    277861  1.96    8511350 59.94   3272369 23.05   50369   0.35    2086734 14.7
Cat_masked  14198683    14071938    99.12   484 0   126 0   74413   0.52    51722   0.36
Dog_masked  14198683    14085865    99.21   697 0   209 0   76317   0.54    35595   0.25
Mouse_masked    14198683    13967382    98.38   450 0   155 0   90013   0.63    140683  0.99
Human_masked    14198683    14121230    99.46   377 0   75  0   48239   0.34    28762   0.2


Genome / Reference  #Reads_processed    #Unmapped   %Unmapped   #One_hit_one_genome %One_hit_one_genome #Multiple_hits_one_genome   %Multiple_hits_one_genome   #One_hit_multiple_genomes   %One_hit_multiple_genomes   Multiple_hits_multiple_genomes  %Multiple_hits_multiple_genomes
adapters    11269161    11269161    100 0   0   0   0   0   0   0   0
PhiX    11269161    11269161    100 0   0   0   0   0   0   0   0
lambda  11269161    11269161    100 0   0   0   0   0   0   0   0
UniVec  11269161    11268923    100 0   0   0   0   139 0   99  0
Bacterial_masked    11269161    11252080    99.85   58  0   13305   0.12    262 0   3456    0.03
Bact_Symbiont   11269161    11266803    99.98   1   0   0   0   23  0   2334    0.02
Mitoch  11269161    11230197    99.65   0   0   0   0   38149   0.34    815 0.01
rRNA    11269161    11263548    99.95   0   0   0   0   1047    0.01    4566    0.04
Target_Ref_genome   11269161    101482  0.9 7978575 70.8    3115013 27.64   23426   0.21    50665   0.45
Cat_masked  11269161    11253212    99.86   8   0   9   0   4986    0.04    10946   0.1
Dog_masked  11269161    11251633    99.85   23  0   20  0   6081    0.05    11404   0.1
Mouse_masked    11269161    11251045    99.84   20  0   11  0   5271    0.05    12814   0.11
Human_masked    11269161    11256012    99.88   14  0   3   0   4471    0.04    8661    0.08

RNA-Seq decontamination • 355 views
1
Entering edit mode

Why are you doing this if I may ask? Generally if your data is NOT aligning to the expected genome at a high enough rate (it will never be 100%), then one goes genome fishing. Since you are aligning short reads some background level of alignment is likely to happen by chance.

0
Entering edit mode

background level of alignment is likely to happen by chance

So what is acceptable background level for a contaminant? 1%, 0.1. 001%? Especially when the reference sequences being checked has been masked for sequences found in the target genome?

1
Entering edit mode

You seem to be approaching this from a different angle than many. If I have reasonably high fraction of reads that align to the right genome then I generally do not worry about what got left behind.

0
Entering edit mode

I think these contamination levels are ok. If I read this correctly, most of the contaminant reads map to multiple genomes? In that case, you're literally dealing with < 1% contamination. Furthermore, Since your target is plant, you should be safe to just map everything and the mammalian reads should go to unmapped. To go one step further towards safety, you could also extract the contaminant reads, map them against your plant reference and see what happens.

0
Entering edit mode

most of the contaminant reads map to multiple genomes?

That is correct, most are reads that are either 'Multiple_hits_one_genome' OR 'Multiple_hits_multiple_genome'

My guess is that these are reads are likely to contain / map to highly repetitive sequences common across both plant and animal kingdoms...

It should be easy to take the 'contaminant reads' and map them to my plant ref.genome - thanks for that suggestion. Cheers!