Question: Removing duplicates in high coverage ancient DNA mitochondrial data
gravatar for
4.0 years ago by
stolarek.ir650 wrote:

Hi all,

I have data coming from sequencing ancient DNA data. The genome in question is mitochondrial. Libraries were sometimes PE, sometimes SE. With duplicates we get very high coverage ~1000-2000x. After remving duplicates we still can maintain ~150x on average. However the data loss is substantial and questions the amount of sequencing used, and then wasted with Picard duplicates removal, which does remove substantial amount of data <- one thing for sure, mitochondrion is very short ~16700 so given the amount of sequencing, it's only natural, that reads are found as duplicates. The other thing is that with aDNA, there is no guarantee of target concentration nor it's completeness (some fragments may just not be present). So this situation represents the most extreme situation for Heng's Li simple equation:

dups = 0.5*m/N m - sequencing reads N - DNA molecules before amplification

Finally mitochondrial sequences are to be assembled to a samples consensus, and SNPs to be determined. Is it wise to remove duplicates from those data sets? Neither the consensus nor SNP calls change with duplicates removal/non removal.

Also worth mentioning is the fact, that following enrichment of mitochondrion, the target is in minority, with most of the sequences steming from bacterial contamination.

So. Removing, not removing? Or any other way to try and estimate it's amount (just theoretical equations?) To support myself I went with preseq library complexity estimation also to look for some correlations in case, I won't find out any definite answer to this PCR duplicates problem

Kind regards

duplicates picard adna • 1.3k views
ADD COMMENTlink modified 4.0 years ago by Brice Sarver3.5k • written 4.0 years ago by stolarek.ir650

First it all depends on the question you are trying to answer, regarding SNPs removing duplication (Marking duplication) will help in finding true positive snps. In case of assembly you need not to remove duplication (except PCR duplication and optical duplicate )

ADD REPLYlink written 4.0 years ago by Medhat8.7k

it;s actually both. First obtaining consensus assembly, and then applying the variants. Mt is haploid thankfully. Either way I find the take away message from your answer, that I should remove PCR and optical duplicates (I'm using Picard for this) even if my coverage is sky high. Many thanks for prompt reply

ADD REPLYlink written 4.0 years ago by stolarek.ir650
gravatar for Brice Sarver
4.0 years ago by
Brice Sarver3.5k
United States
Brice Sarver3.5k wrote:

If your coverage is extremely high, additional information is not providing you more support for your conclusion - it's just duplicated data. You can identify duplicates and remove them down to an appropriate coverage (say, 30X or greater).

For de novo assembly, I recommend downsampling to somewhere around 60X or less. This is because many assemblers expect a certain coverage and will begin to split contigs that have excessive coverage. I have done a lot of mitochondrial assemblies and I noticed this all the time. For example, if my average coverage was 200X and some regions dipped below 100X (still very high), the contigs would be split at that point. Some assemblers will allow you to specify coverage a priori and attempt to alleviate this.

ADD COMMENTlink written 4.0 years ago by Brice Sarver3.5k

I think 50x should be optimal

from this paper

Identification of Optimum Sequencing Depth Especially for De Novo Genome Assembly of Small Genomes Using Next Generation Sequencing Data

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Medhat8.7k

yea, in aDNA coverage bumps are huge (due to variation in sequence survival), so I employed simple sequence consensus. I think it's justified by the fact that mt is haploid and doesn't undergo recombination.

Greatfull for supporting my logic that I developed, that additional ultra high coverage doesn't bring anything new (95% of the sequence is covered at least at 50X after dups removal, so tat settles it). Thanks for the exact numbers!

ADD REPLYlink written 4.0 years ago by stolarek.ir650

Hi, Is Sequence coverage calculated as (total reads length/reference genome length) before or after removing duplicated and contained reads?

I am doing repeat identification based on the frequency of reads to flag the read as repeat if its frequency is higher that the double of coverage and when I tried to calculate the coverage before removing duplication it is too high so no read match as repeat while I know that genome got repetitive regions with length more that read length

ADD REPLYlink written 19 months ago by sherifmagdy20070
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1286 users visited in the last hour