Question

big loss after markduplication

0

Entering edit mode

3.0 years ago

reza ▴ 300

I checked my reads using FastQC and everything is ok about duplicated reads but during the MarkDuplication step, I loss 55% of reads as duplicated reads! what happens? Can I continue the downstream analysis with the output file from Picard after removing duplicated reads? 55% is normal!??

MarkDuplicates Re-sequencing Picard • 1.8k views

ADD COMMENT • link updated 3.0 years ago by David Parry ▴ 130 • written 3.0 years ago by reza ▴ 300

1

Entering edit mode

What kind of an experiment is this? Are you calling SNP?

ADD REPLY • link 3.0 years ago by GenoMax 141k

0

Entering edit mode

yes, the first step is SNP calling, but I will use this data to identify the signature of selection and introgression in next steps

ADD REPLY • link 3.0 years ago by reza ▴ 300

0

Entering edit mode

Not much you can do as long as you did the marking right. You may have done too many PCR cycles if the input DNA was low concentration.

ADD REPLY • link 3.0 years ago by GenoMax 141k

0

Entering edit mode

What is the problem if I want to use this data for the analyzes I mentioned earlier (SNP calling, detection of the signatures selection, and introgression)?

ADD REPLY • link 3.0 years ago by reza ▴ 300

0

Entering edit mode

You will need to provide more information about your experiment to get a truly useful answer. It's important to know that MarkDuplicates works by simply identifying reads/read pairs with identical mapping coordinates, so if your experiment is amplicon based or enriches for a small target it will give you a much higher estimate compared to the likely real number of duplicates (in these circumstances it may not be appropriate to mark duplicates). Also, data from single-end reads will produce a higher number of estimated duplicates than paired-end reads as the number of unique mapping positions will be fewer.

ADD REPLY • link 3.0 years ago by David Parry ▴ 130

0

Entering edit mode

Whole-genome sequencing data are paired-end sequenced using Illumina Hiseq 2500 (150 bp) and I want to do SNP calling, detection of the signature of selection, and introgression. What information do I need to give to get the right answer?

ADD REPLY • link 3.0 years ago by reza ▴ 300

0

Entering edit mode

55% is a very high number of duplicates for WGS. There's no hard and fast rule but I would generally expect closer to 10% for PCR-based WGS library prep, so it suggests to me that something either went wrong during the library prep or something is going wrong as your marking duplicates.

Is 55% the figure taken given by the metrics file that Picard produces or did you calculate this figure some other way?

ADD REPLY • link 3.0 years ago by David Parry ▴ 130

0

Entering edit mode

55% is in metrics file outputted from Picard

ADD REPLY • link 3.0 years ago by reza ▴ 300

0

Entering edit mode

Now, after deletion of duplicated reads, What is the problem if I want to use this data for the analyzes I mentioned earlier (SNP calling, detection of the signatures selection, and introgression)? Please help me to make the right decision on my data. I must ignore my data??

ADD REPLY • link 3.0 years ago by reza ▴ 300

0

Entering edit mode

You can still call SNPs. Your variant caller should ignore any reads marked as duplicates so they won't interfere with your variant calling, but you should probably assess your depth of coverage after marking duplicates so you can infer your sensitivity to detect variants.

ADD REPLY • link 3.0 years ago by David Parry ▴ 130