Question: Duplicate reads versus unmapped reads
gravatar for pkallurkar
4 weeks ago by
pkallurkar0 wrote:


I have 17 samples collected through patch-seq, processed using SMART-Seq HT Plus Kit, and sequenced on the HiSeq X platform. I used STAR v2.7.7a to align my reads to the mouse genome mm10 (my reference genome). I'm relatively new to this field and would like to understand the difference between unmapped reads (flagged by STAR) and duplicate reads (reported by FastQC). We have a high value of duplicate reads which I read is "normal" for single-cell RNA seq (assuming that it applies to patch-seq too) because of overamplification which is inevitable for a small amount of starting material. In addition to this, we have some samples with high unmapped reads too. Are the unmapped reads the same as duplicate reads? Or do they include duplicate reads? I have attached a link for the graph of %unmapped reads and %duplicate reads for all my samples.

Thanks in advance. enter image description here
enter image description here

fastqc rna-seq star patch-seq • 171 views
ADD COMMENTlink modified 4 weeks ago by Pierre Lindenbaum134k • written 4 weeks ago by pkallurkar0

Are the unmapped reads the same as duplicate reads?

Some undoubtedly are but they are not all likely to be duplicates.

SMRT-seq is the full length RNAseq kit correct? Have you taken some of the unmapped reads and blasted them at NCBI to see what they are. Specifically to rule out problem with contamination of some sort.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by GenoMax96k

I am very new to this. I tried blasting only the unmapped reads. I keep getting Server error on NCBI. So I tried the command line version of BLAST which generated a ~274GB text file. I don't know how to visualize this file or even open this file as my Mac keeps crashing whenever I try. Any ideas?

ADD REPLYlink written 4 weeks ago by pkallurkar0

When doing diagnostic testing you should take a few reads (< 10) and then use the web blast interface to do the search. Convert the fastq sequence to fasta format when you paste them into blast web search.

ADD REPLYlink written 4 weeks ago by GenoMax96k

I tried for the top 10 sequences of the unmapped reads for my first Sample. The percent identity hit for Mus Musculus is >99.13% e-value <3e-50. The rest are with rats and other members of the rodent family plus primates. This is very confusing for me now, if the e-value is so low why is STAR assigning it as unmapped?

ADD REPLYlink written 4 weeks ago by pkallurkar0

Are you sure you are selecting unmapped reads? Did you use --outReadsUnmapped Fastx option to get reads that are unmapped?

Another possibility. If these reads are multi-mapping (I think STAR allows max of 10 locations by default) STAR may be marking them as unmapped.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by GenoMax96k

Yes, I did use --outReadsUnmapped Fastx to get the unmapped reads. The amount of multi-mapped reads is relatively low (4.62 ± 1.46 %).

ADD REPLYlink written 4 weeks ago by pkallurkar0

This is puzzling. You have very high % of unmapped reads but the reads when spot checked at NCBI are showing up as aligning to correct genome. Did you run STAR with more or less default options? If you did not then could you do that and see if the % unmapped number goes down?

ADD REPLYlink written 4 weeks ago by GenoMax96k

I mostly used the default options: For indexing, I did set the --sjdbOverhang 149 as I have 2*150bp reads. And set the genomeSAsparseD to 2 to save computing power. The code line used: STAR --runThreadN 4 --runMode genomeGenerate --genomeDir {path to genome folder} --genomeFastaFiles Mus_musculus.GRCm38.dna.primary_assembly.fa --sjdbGTFfile Mus_musculus.GRCm38.102.gtf --sjdbOverhang 149 genomeSAsparseD 2

For alignment, I used default settings: STAR --runThreadN 4 --runMode alignReads --readFilesCommand gunzip -c --genomeDir {path_to_Dir} --outFileNamePrefix Sample1_R1 --readFilesIn {read_1}, {read_2} --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx

Does the above code look alright or what parameters should I play around with to improve my results?

ADD REPLYlink written 4 weeks ago by pkallurkar0

I am not a regular STAR user but those are more or less default options for alignments. I think that

default: 10
int: maximum number of loci the read is allowed to map to. Alignments (all of them) will be output only if the read maps to no more loci than this value.

is causing these reads to end up in "unmapped files". Even if you are not explicitly changing this parameter it is getting applied when a read maps to more than 10 locations.

Does your data have rRNA contamination? Is it from FFPE type samples where you have degraded short RNA's?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by GenoMax96k

No, we don't have FFPE type samples. We extracted samples from fresh tissue. I'm mapping the samples to rRNA FASTA from NCBI to check for rRNA contamination. I just did Qualimap on the sample and see a relatively high percentage of introns in some of the samples which also points to rRNA contamination.

ADD REPLYlink written 4 weeks ago by pkallurkar0

Let us know how that goes. rRNA contamination seems like the best bet for now.

ADD REPLYlink written 4 weeks ago by GenoMax96k

I'm still running mapping for rRNA fasta. But I noticed something in one of the unmapped reads that there is a sequence, primarily comprising of A's (with N) occurring multiple times. I have attached the image with these sequences in the red box. I think it is crap and I don't get any hits on BLAST. Any idea what does this mean and why this is appearing:





ADD REPLYlink modified 29 days ago • written 29 days ago by pkallurkar0

Those reads are not useful. Perhaps they are a side effect of SMRTseq. So your STAR result may be ok after all. STAR was able to align whatever reads it could and the rest are .. you know.

ADD REPLYlink modified 29 days ago • written 29 days ago by GenoMax96k

That's a bummer. Thanks for the reply. Is the data still usable though?

ADD REPLYlink modified 29 days ago • written 29 days ago by pkallurkar0

Some samples have very high (70% unmapped) reads so not sure how that is going to affect your downstream analysis. You may have to try that out and see what happens. Your power to detect changes may be limited by amount of data that aligned.

What genome is the data from? How many million reads have aligned for the sample where 70% reads are unmapped.

You could use and scan/remove these polyA/polyT containing reads. My hunch is that remaining reads would likely equate to what STAR was able to align.

ADD REPLYlink written 29 days ago by GenoMax96k

The genome is the mouse genome (mm10). The sample with ~70% unmapped reads has ~21.5M uniquely aligned reads (number of raw reads >70M), so definitely, it is not the best sample.

I will try to scan and remove the polyA/polyT containing reads as you suggested. Hopefully, it helps in making a call on which sample is usable and which is not.

ADD REPLYlink written 29 days ago by pkallurkar0

If you have ~22 M mapped reads in that worst sample then all data should be OK. That is good news. I would say instead of spending time on the unmapped reads go forward with your analysis.

ADD REPLYlink modified 29 days ago • written 29 days ago by GenoMax96k

You just made my day. Thanks a ton.

ADD REPLYlink written 29 days ago by pkallurkar0
gravatar for Pierre Lindenbaum
4 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum134k wrote:

. Are the unmapped reads the same as duplicate reads? Or do they include duplicate reads?

unmapped reads have not been mapped on the REFerence genome

duplicate reads:

Duplicates can arise during sample preparation e.g. library construction using PCR. See also EstimateLibraryComplexity for additional notes on PCR duplication artifacts. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates.

PCR-associated duplication artifacts can result from: inadequate amounts of starting material (genomic DNA, cDNA, etc.), losses during cleanups, and size selection issues. Duplicate reads can also arise from optical duplicates resulting from sequencing-machine optical sensor artifacts.

ADD COMMENTlink written 4 weeks ago by Pierre Lindenbaum134k

So, can some unmapped reads also be duplicate reads?

ADD REPLYlink written 4 weeks ago by pkallurkar0


ADD REPLYlink written 4 weeks ago by GenoMax96k

Thank you, really appreciate the prompt replies.

ADD REPLYlink written 4 weeks ago by pkallurkar0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1910 users visited in the last hour