Hi!
I have received a series of human poly-A RNA-seq samples (single-end 75 bp) which display suspicious mapping values. These samples have been mapped with STAR and show +/- 30-50% of reads "unmapped: reads too short". Previous samples done with the same method had only between 5 and 10%.
Despite the sharp drops of uniquely mapping reads the sequencing worked well (many genes detected, mapping to exons, splicing visible, ...).
After careful inspection of the reads I start to suspect a bacterial contamination as:
- Many of the blasted reads are a perfect match with E. Coli or other prokaryotes.
- These are not ribosomal reads (evaluated with BBDuk).
- They do not appear to contain the primers / adapter sequences used in the library preparation.
- If I map these reads to a hybrid E. Coli 16S - h38 genome I get 10-100 times more reads mapping to this E. Coli genome in these new samples than in the old ones.
I would like to evaluate the proportion of reads coming from prokaryotes (E. Coli?) in these samples. As I am not familiar with the metagenomics field, I was wondering if someone could recommend a procedure to do so.
I am also open to other suggestions regarding the possible issues with these samples.
Thank you in advance!
try with
fastqscreen. Index the E. coli genome, edit the configuration file. Fastqscreen prints our the contamination levels. Please increase the numbers of reads to be analyzed.