Question: Removal of PCR dupliactes from trimmed reads
0
gravatar for deepti1rao
3.3 years ago by
deepti1rao30
deepti1rao30 wrote:

I have tried samtools rmdup on my paired end fastq files, which were earlier trimmed. According to the samtools manual, rmdup works as follows: Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality.

I have 23% duplicates in my data (found by aligning raw reads to the reference). Trimming the raw reads would have trimmed duplicates into reads of different lengths. How then would rmdup work on my pre-processed reads?

Is there a better option?

ADD COMMENTlink modified 3.3 years ago by genomax92k • written 3.3 years ago by deepti1rao30
1

There are two things that you (I think) are mixing up.

1) trimming is done on the fastq files, but rmdup works on aligned data

2) duplicates are defined by the 5' ends of the paired-end data, while trimming takes away bases from the 3' end, so no worries, trimming will not affect duplicate removal.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by ATpoint41k

"external coordinates" means both 5' as well as 3', doesn't it?

ADD REPLYlink written 3.3 years ago by deepti1rao30
1

No, as in paired-end, the insert size, which defines the fragment, are solely defined by the 5' ends of the respective fwd and rev reads. Have a look at this figure, you can see that no matter where the 3' ends are (so the arrow heads), the 5' ends are unaffected by this, and so is the insert size and by this the definiton of a duplicate.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by ATpoint41k

Thanks! Got it! Why do I still have 14% duplicates?

ADD REPLYlink written 3.3 years ago by deepti1rao30

Also, I had 23% duplicates earlier. After using rmdup, I'm still left with 14% duplicates.

ADD REPLYlink written 3.3 years ago by deepti1rao30

samtools rmdum does not remove duplicates when paired reads map to different chromosomes. Do these 14% duplicates left map to different chromosomes?

ADD REPLYlink written 3.3 years ago by Tom_L340

How can I find this out?

ADD REPLYlink written 3.3 years ago by deepti1rao30

Hello,

how do you trimm your reads? When trimming it can happen that not all reads survive it because they are to short now.

fin swimmer

ADD REPLYlink written 3.3 years ago by finswimmer14k

I used bbduk from bbtools to trim my reads. Yes, I lost some reads in the process of doing so, but not many.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by deepti1rao30
1
gravatar for genomax
3.3 years ago by
genomax92k
United States
genomax92k wrote:

clumpify will allow you to remove all kinds of duplicates: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. You also do not need to align the data to remove duplicates.

What kind of an experiment is this?

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by genomax92k

I'm aware that there are some tools,which remove duplicates from reads, without the need to align them. But, I dont understand how these can make out the difference between a PCR duplicate and a repeat in the genome.

ADD REPLYlink written 3.3 years ago by deepti1rao30

But, I dont understand how these can make out the difference between a PCR duplicate and a repeat in the genome.

If reads are identical over the entire length of the read then there is no need to worry about if they come from two different repeats in the genome. You can't distinguish them by any means.

ADD REPLYlink written 3.3 years ago by genomax92k
1

The situation you describe would lead to a low mapping quality during the alignment, probably even 0. Good practice is to discard these reads categorically because, as you say, one cannot distinguish properly. I typically remove duplicates after all the filtering steps. For standard assays like ATAC or ChIP, keep only reads that are properly-paired, primary alignments, with a MAPQ >=30 and a reasonable insert size (2000bp usually), followed by duplicate removal.

ADD REPLYlink written 3.3 years ago by ATpoint41k

I'm worried about bias owing to duplicates, during variant calling.

ADD REPLYlink written 3.3 years ago by deepti1rao30
0
gravatar for vmicrobio
3.3 years ago by
vmicrobio260
vmicrobio260 wrote:

I use prinseq (-derep option) to remove duplicates

ADD COMMENTlink written 3.3 years ago by vmicrobio260

Thanks! Will try prinseq!

ADD REPLYlink written 3.3 years ago by deepti1rao30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1890 users visited in the last hour