Question: Removal of PCR dupliactes from trimmed reads
0
gravatar for deepti1rao
20 months ago by
deepti1rao20
deepti1rao20 wrote:

I have tried samtools rmdup on my paired end fastq files, which were earlier trimmed. According to the samtools manual, rmdup works as follows: Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality.

I have 23% duplicates in my data (found by aligning raw reads to the reference). Trimming the raw reads would have trimmed duplicates into reads of different lengths. How then would rmdup work on my pre-processed reads?

Is there a better option?

ADD COMMENTlink modified 20 months ago by genomax65k • written 20 months ago by deepti1rao20
1

There are two things that you (I think) are mixing up.

1) trimming is done on the fastq files, but rmdup works on aligned data

2) duplicates are defined by the 5' ends of the paired-end data, while trimming takes away bases from the 3' end, so no worries, trimming will not affect duplicate removal.

ADD REPLYlink modified 20 months ago • written 20 months ago by ATpoint15k

"external coordinates" means both 5' as well as 3', doesn't it?

ADD REPLYlink written 20 months ago by deepti1rao20
1

No, as in paired-end, the insert size, which defines the fragment, are solely defined by the 5' ends of the respective fwd and rev reads. Have a look at this figure, you can see that no matter where the 3' ends are (so the arrow heads), the 5' ends are unaffected by this, and so is the insert size and by this the definiton of a duplicate.

ADD REPLYlink modified 20 months ago • written 20 months ago by ATpoint15k

Thanks! Got it! Why do I still have 14% duplicates?

ADD REPLYlink written 20 months ago by deepti1rao20

Also, I had 23% duplicates earlier. After using rmdup, I'm still left with 14% duplicates.

ADD REPLYlink written 20 months ago by deepti1rao20

samtools rmdum does not remove duplicates when paired reads map to different chromosomes. Do these 14% duplicates left map to different chromosomes?

ADD REPLYlink written 20 months ago by Tom_L310

How can I find this out?

ADD REPLYlink written 20 months ago by deepti1rao20

Hello,

how do you trimm your reads? When trimming it can happen that not all reads survive it because they are to short now.

fin swimmer

ADD REPLYlink written 20 months ago by finswimmer11k

I used bbduk from bbtools to trim my reads. Yes, I lost some reads in the process of doing so, but not many.

ADD REPLYlink modified 20 months ago • written 20 months ago by deepti1rao20
1
gravatar for genomax
20 months ago by
genomax65k
United States
genomax65k wrote:

clumpify will allow you to remove all kinds of duplicates: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. You also do not need to align the data to remove duplicates.

What kind of an experiment is this?

ADD COMMENTlink modified 20 months ago • written 20 months ago by genomax65k

I'm aware that there are some tools,which remove duplicates from reads, without the need to align them. But, I dont understand how these can make out the difference between a PCR duplicate and a repeat in the genome.

ADD REPLYlink written 20 months ago by deepti1rao20

But, I dont understand how these can make out the difference between a PCR duplicate and a repeat in the genome.

If reads are identical over the entire length of the read then there is no need to worry about if they come from two different repeats in the genome. You can't distinguish them by any means.

ADD REPLYlink written 20 months ago by genomax65k
1

The situation you describe would lead to a low mapping quality during the alignment, probably even 0. Good practice is to discard these reads categorically because, as you say, one cannot distinguish properly. I typically remove duplicates after all the filtering steps. For standard assays like ATAC or ChIP, keep only reads that are properly-paired, primary alignments, with a MAPQ >=30 and a reasonable insert size (2000bp usually), followed by duplicate removal.

ADD REPLYlink written 20 months ago by ATpoint15k

I'm worried about bias owing to duplicates, during variant calling.

ADD REPLYlink written 20 months ago by deepti1rao20
0
gravatar for vmicrobio
20 months ago by
vmicrobio240
vmicrobio240 wrote:

I use prinseq (-derep option) to remove duplicates

ADD COMMENTlink written 20 months ago by vmicrobio240

Thanks! Will try prinseq!

ADD REPLYlink written 20 months ago by deepti1rao20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1847 users visited in the last hour