I am working on a set of microRNA rna-seq data. One strange problem that we have noticed while checking the data quality with FastQC is that a large portion of the reads in all samples (roughly 40% to 60%) in all our samples are duplicates of just one read (it comes to around roughly 2-4 million reads in all samples). FastQC tags this sequence as a possible PCR primer. We tried to BLAST this sequence to miRBase (after removing the adapter), but couldn't find a matching microRNA. My colleagues are suggesting that this could be biological, but I am not convinced. So my questions are assuming that FastQC tagging of this read as a PCR primer is a false positive, could it be possible that one microRNA is dominant in all the sequenced samples? and how can we confirm whether it is biological or a problem during sequencing ?

We contacted the folks who sequenced our samples (done externally) with the problem I mentioned. After some checking (I don't know the details yet), they informed us that it was an error in library preparation/sequencing step, and agreed to re-sequence our samples. So, thank you all for taking interest.

also my own 2 cents - a life scientist is usually like Fox Mulder from the X-Files his motto was I want to believe. As a bioinformatician I feel I am Dana Scully who always skeptical.

i just have to see this read

that right here, make a new question put your read there and here is a title for it: All my data looks alike. Help me decide: is it a new insight or just a bad run?

As a first step, I would suggest to also perform a BLAST search of the NCBI nucleotide database in order to identify any other potential source of the sequence. I think there are several possible sources of "contamination" during the preparation of a smallRNA-Seq library (I have personally seen fragments of rRNA which were amplified during the first PCR amplification of the small RNAs which had been size fractionated prior to amplification). The fact that this one sequence so highly abundant indicates that it is a PCR artifact.

A further question is: Does FastQC identify the primer sequence? It should do so, as it has uses a list of oligos as reference to e.g. name the different Illumina adaptors and primers.


Actually we did BLAST on NCBI nucleotide database, but the results were pretty un-conclusive, a lot of hits with very high e-values. And yes, without the adapters trimmed, FastQC identified the primer sequence, but when the adapters were trimmed it did not.

Don't forget to Google this sequence. I had an experience where MegaBLAST failed to identify a tRNA sequence with a post-processed end whereas Google found it mentioned in a paper.

Never thought of that, Thank you.

