Removing duplicates in high coverage ancient DNA mitochondrial data
1
0
Entering edit mode
7.6 years ago
stolarek.ir ▴ 700

Hi all,

I have data coming from sequencing ancient DNA data. The genome in question is mitochondrial. Libraries were sometimes PE, sometimes SE. With duplicates we get very high coverage ~1000-2000x. After remving duplicates we still can maintain ~150x on average. However the data loss is substantial and questions the amount of sequencing used, and then wasted with Picard duplicates removal, which does remove substantial amount of data <- one thing for sure, mitochondrion is very short ~16700 so given the amount of sequencing, it's only natural, that reads are found as duplicates. The other thing is that with aDNA, there is no guarantee of target concentration nor it's completeness (some fragments may just not be present). So this situation represents the most extreme situation for Heng's Li simple equation:

dups = 0.5*m/N m - sequencing reads N - DNA molecules before amplification http://seqanswers.com/forums/showthread.php?t=6854

Finally mitochondrial sequences are to be assembled to a samples consensus, and SNPs to be determined. Is it wise to remove duplicates from those data sets? Neither the consensus nor SNP calls change with duplicates removal/non removal.

Also worth mentioning is the fact, that following enrichment of mitochondrion, the target is in minority, with most of the sequences steming from bacterial contamination.

So. Removing, not removing? Or any other way to try and estimate it's amount (just theoretical equations?) To support myself I went with preseq library complexity estimation also to look for some correlations in case, I won't find out any definite answer to this PCR duplicates problem

Kind regards

aDNA duplicates Picard • 2.5k views
ADD COMMENT
1
Entering edit mode

First it all depends on the question you are trying to answer, regarding SNPs removing duplication (Marking duplication) will help in finding true positive snps. In case of assembly you need not to remove duplication (except PCR duplication and optical duplicate )

ADD REPLY
0
Entering edit mode

it;s actually both. First obtaining consensus assembly, and then applying the variants. Mt is haploid thankfully. Either way I find the take away message from your answer, that I should remove PCR and optical duplicates (I'm using Picard for this) even if my coverage is sky high. Many thanks for prompt reply

ADD REPLY
1
Entering edit mode
7.6 years ago
Brice Sarver ★ 3.8k

If your coverage is extremely high, additional information is not providing you more support for your conclusion - it's just duplicated data. You can identify duplicates and remove them down to an appropriate coverage (say, 30X or greater).

For de novo assembly, I recommend downsampling to somewhere around 60X or less. This is because many assemblers expect a certain coverage and will begin to split contigs that have excessive coverage. I have done a lot of mitochondrial assemblies and I noticed this all the time. For example, if my average coverage was 200X and some regions dipped below 100X (still very high), the contigs would be split at that point. Some assemblers will allow you to specify coverage a priori and attempt to alleviate this.

ADD COMMENT
0
Entering edit mode

yea, in aDNA coverage bumps are huge (due to variation in sequence survival), so I employed simple sequence consensus. I think it's justified by the fact that mt is haploid and doesn't undergo recombination.

Greatfull for supporting my logic that I developed, that additional ultra high coverage doesn't bring anything new (95% of the sequence is covered at least at 50X after dups removal, so tat settles it). Thanks for the exact numbers!

ADD REPLY
0
Entering edit mode

Hi, Is Sequence coverage calculated as (total reads length/reference genome length) before or after removing duplicated and contained reads?

I am doing repeat identification based on the frequency of reads to flag the read as repeat if its frequency is higher that the double of coverage and when I tried to calculate the coverage before removing duplication it is too high so no read match as repeat while I know that genome got repetitive regions with length more that read length

ADD REPLY

Login before adding your answer.

Traffic: 2955 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6