Question

Duplicate reads versus unmapped reads

0

Entering edit mode

4.4 years ago

pkallurkar • 0

Hello,

I have 17 samples collected through patch-seq, processed using SMART-Seq HT Plus Kit, and sequenced on the HiSeq X platform. I used STAR v2.7.7a to align my reads to the mouse genome mm10 (my reference genome). I'm relatively new to this field and would like to understand the difference between unmapped reads (flagged by STAR) and duplicate reads (reported by FastQC). We have a high value of duplicate reads which I read is "normal" for single-cell RNA seq (assuming that it applies to patch-seq too) because of overamplification which is inevitable for a small amount of starting material. In addition to this, we have some samples with high unmapped reads too. Are the unmapped reads the same as duplicate reads? Or do they include duplicate reads? I have attached a link for the graph of %unmapped reads and %duplicate reads for all my samples.

Thanks in advance. enter image description here

patch-seq RNA-Seq STAR FastQC • 3.6k views

ADD COMMENT • link updated 4.4 years ago by Pierre Lindenbaum 166k • written 4.4 years ago by pkallurkar • 0

0

Entering edit mode

Are the unmapped reads the same as duplicate reads?

Some undoubtedly are but they are not all likely to be duplicates.

SMRT-seq is the full length RNAseq kit correct? Have you taken some of the unmapped reads and blasted them at NCBI to see what they are. Specifically to rule out problem with contamination of some sort.

ADD REPLY • link 4.4 years ago by GenoMax 152k

0

Entering edit mode

I am very new to this. I tried blasting only the unmapped reads. I keep getting Server error on NCBI. So I tried the command line version of BLAST which generated a ~274GB text file. I don't know how to visualize this file or even open this file as my Mac keeps crashing whenever I try. Any ideas?

ADD REPLY • link 4.4 years ago by pkallurkar • 0

0

Entering edit mode

When doing diagnostic testing you should take a few reads (< 10) and then use the web blast interface to do the search. Convert the fastq sequence to fasta format when you paste them into blast web search.

ADD REPLY • link 4.4 years ago by GenoMax 152k

0

Entering edit mode

I tried for the top 10 sequences of the unmapped reads for my first Sample. The percent identity hit for Mus Musculus is >99.13% e-value <3e-50. The rest are with rats and other members of the rodent family plus primates. This is very confusing for me now, if the e-value is so low why is STAR assigning it as unmapped?

ADD REPLY • link 4.4 years ago by pkallurkar • 0

0

Entering edit mode

Are you sure you are selecting unmapped reads? Did you use --outReadsUnmapped Fastx option to get reads that are unmapped?

Another possibility. If these reads are multi-mapping (I think STAR allows max of 10 locations by default) STAR may be marking them as unmapped.

ADD REPLY • link 4.4 years ago by GenoMax 152k

0

Entering edit mode

Yes, I did use --outReadsUnmapped Fastx to get the unmapped reads. The amount of multi-mapped reads is relatively low (4.62 ± 1.46 %).

ADD REPLY • link 4.4 years ago by pkallurkar • 0

0

Entering edit mode

This is puzzling. You have very high % of unmapped reads but the reads when spot checked at NCBI are showing up as aligning to correct genome. Did you run STAR with more or less default options? If you did not then could you do that and see if the % unmapped number goes down?

ADD REPLY • link 4.4 years ago by GenoMax 152k

0

Entering edit mode

I mostly used the default options: For indexing, I did set the --sjdbOverhang 149 as I have 2*150bp reads. And set the genomeSAsparseD to 2 to save computing power. The code line used: STAR --runThreadN 4 --runMode genomeGenerate --genomeDir {path to genome folder} --genomeFastaFiles Mus_musculus.GRCm38.dna.primary_assembly.fa --sjdbGTFfile Mus_musculus.GRCm38.102.gtf --sjdbOverhang 149 genomeSAsparseD 2

For alignment, I used default settings: STAR --runThreadN 4 --runMode alignReads --readFilesCommand gunzip -c --genomeDir {path_to_Dir} --outFileNamePrefix Sample1_R1 --readFilesIn {read_1}, {read_2} --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx

Does the above code look alright or what parameters should I play around with to improve my results?

ADD REPLY • link 4.4 years ago by pkallurkar • 0

0

Entering edit mode

I am not a regular STAR user but those are more or less default options for alignments. I think that

--outFilterMultimapNmax
default: 10
int: maximum number of loci the read is allowed to map to. Alignments (all of them) will be output only if the read maps to no more loci than this value.

is causing these reads to end up in "unmapped files". Even if you are not explicitly changing this parameter it is getting applied when a read maps to more than 10 locations.

Does your data have rRNA contamination? Is it from FFPE type samples where you have degraded short RNA's?

ADD REPLY • link 4.4 years ago by GenoMax 152k

0

Entering edit mode

No, we don't have FFPE type samples. We extracted samples from fresh tissue. I'm mapping the samples to rRNA FASTA from NCBI to check for rRNA contamination. I just did Qualimap on the sample and see a relatively high percentage of introns in some of the samples which also points to rRNA contamination.

ADD REPLY • link 4.4 years ago by pkallurkar • 0

1

Entering edit mode

Let us know how that goes. rRNA contamination seems like the best bet for now.

ADD REPLY • link 4.4 years ago by GenoMax 152k

0

Entering edit mode

I'm still running mapping for rRNA fasta. But I noticed something in one of the unmapped reads that there is a sequence, primarily comprising of A's (with N) occurring multiple times. I have attached the image with these sequences in the red box. I think it is crap and I don't get any hits on BLAST. Any idea what does this mean and why this is appearing:

**>MG01HX01:1039:HCTMGCCX2:8:1101:1255:54717/1 AAANAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAACAAACAAAAAAAATTAAAAGAAA

MG01HX01:1039:HCTMGCCX2:8:1101:1255:54717/2 NTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTGTTTTTTTTTGTTTTTT** MG01HX01:1039:HCTMGCCX2:8:1101:1255:57213/1 TGTNGTTCTGTGTAGTTCATATAAAATGGAAAGGGCAAAGACAAGATTCAGATTAAATATTAAATATTATTCTGGAAGCTGGCTAGACTCTCAGAGGGTCAGGAACACAGAATTAGTTTGGAACTTTTGAGAGGAGATCGGAAGAGCACAC MG01HX01:1039:HCTMGCCX2:8:1101:1255:57213/2 NCTCTCAAAAGTTCCAAACTAATTCTGTGTTCCTGACCCTCTGAGAGTCTAGCCAGCTTCCAGAATAATATTTAATATTTAATCTGAATCTTGTCTTTGCCCTTTCCATTTTATATGAACTACACAGAACTACAAGATCGGAAGAGCGTCG >MG01HX01:1039:HCTMGCCX2:8:1101:1255:61784/1 AAANAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA MG01HX01:1039:HCTMGCCX2:8:1101:1255:61784/2 NTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

**>MG01HX01:1039:HCTMGCCX2:8:1101:1255:71735/1 AAANAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

MG01HX01:1039:HCTMGCCX2:8:1101:1255:71735/2 NTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTGTTTTGTTTTTTTTTTTT** MG01HX01:1039:HCTMGCCX2:8:1101:1255:72332/1 AAANAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAACAACACCAAAACAAAGAATAAAAAAAACAAAAACTAAAAAGACAAACAGAGATAC MG01HX01:1039:HCTMGCCX2:8:1101:1255:72332/2 NTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTGTTTTTGTGTTTTCTCTTTTCTTTTC MG01HX01:1039:HCTMGCCX2:8:1101:1255:73387/1 CCANTGCTTAAAAAAAAAAAAAAAAAAAAAAAAAACTCTGCGATGAAACCAAAAATTAAAAAAAAAAAAAAAAAAAAAAAAAACTTCGATGATAAAACTCCTATAAAAAAAAAAAAAAAGCTTAAAAAAAAAAAAAAAAAAAAAAAAAAAA MG01HX01:1039:HCTMGCCX2:8:1101:1255:73387/2 NTTTTTTTTTTTTTTTATCAGTTGTATCAACGCTTTGTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTCTTTTTTTTGCACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTCTTTGTGTGTT >MG01HX01:1039:HCTMGCCX2:8:1101:1265:31599/1 AAANAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAAA MG01HX01:1039:HCTMGCCX2:8:1101:1265:31599/2 NTCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

ADD REPLY • link 4.4 years ago by pkallurkar • 0

1

Entering edit mode

Those reads are not useful. Perhaps they are a side effect of SMRTseq. So your STAR result may be ok after all. STAR was able to align whatever reads it could and the rest are .. you know.

ADD REPLY • link 4.4 years ago by GenoMax 152k

0

Entering edit mode

That's a bummer. Thanks for the reply. Is the data still usable though?

ADD REPLY • link 4.4 years ago by pkallurkar • 0

0

Entering edit mode

Some samples have very high (70% unmapped) reads so not sure how that is going to affect your downstream analysis. You may have to try that out and see what happens. Your power to detect changes may be limited by amount of data that aligned.

What genome is the data from? How many million reads have aligned for the sample where 70% reads are unmapped.

You could use bbduk.sh and scan/remove these polyA/polyT containing reads. My hunch is that remaining reads would likely equate to what STAR was able to align.

ADD REPLY • link 4.4 years ago by GenoMax 152k

0

Entering edit mode

The genome is the mouse genome (mm10). The sample with ~70% unmapped reads has ~21.5M uniquely aligned reads (number of raw reads >70M), so definitely, it is not the best sample.

I will try to scan and remove the polyA/polyT containing reads as you suggested. Hopefully, it helps in making a call on which sample is usable and which is not.

ADD REPLY • link 4.4 years ago by pkallurkar • 0

1

Entering edit mode

If you have ~22 M mapped reads in that worst sample then all data should be OK. That is good news. I would say instead of spending time on the unmapped reads go forward with your analysis.

ADD REPLY • link 4.4 years ago by GenoMax 152k

0

Entering edit mode

You just made my day. Thanks a ton.

ADD REPLY • link 4.4 years ago by pkallurkar • 0

score 1 · Answer 1 · 2021-01-25

1

Entering edit mode

4.4 years ago

Pierre Lindenbaum 166k

. Are the unmapped reads the same as duplicate reads? Or do they include duplicate reads?

unmapped reads have not been mapped on the REFerence genome

duplicate reads:

Duplicates can arise during sample preparation e.g. library construction using PCR. See also EstimateLibraryComplexity for additional notes on PCR duplication artifacts. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates.

PCR-associated duplication artifacts can result from: inadequate amounts of starting material (genomic DNA, cDNA, etc.), losses during cleanups, and size selection issues. Duplicate reads can also arise from optical duplicates resulting from sequencing-machine optical sensor artifacts.

ADD COMMENT • link 4.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

So, can some unmapped reads also be duplicate reads?

ADD REPLY • link 4.4 years ago by pkallurkar • 0

0

Entering edit mode

Yes.

ADD REPLY • link 4.4 years ago by GenoMax 152k

0

Entering edit mode

Thank you, really appreciate the prompt replies.

ADD REPLY • link 4.4 years ago by pkallurkar • 0