Question

Removal of PCR dupliactes from trimmed reads

0

Entering edit mode

6.7 years ago

deepti1rao ▴ 50

I have tried samtools rmdup on my paired end fastq files, which were earlier trimmed. According to the samtools manual, rmdup works as follows: Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality.

I have 23% duplicates in my data (found by aligning raw reads to the reference). Trimming the raw reads would have trimmed duplicates into reads of different lengths. How then would rmdup work on my pre-processed reads?

Is there a better option?

samtools rmdup duplicates pre-processed reads • 3.7k views

ADD COMMENT • link updated 6.7 years ago by GenoMax 141k • written 6.7 years ago by deepti1rao ▴ 50

1

Entering edit mode

There are two things that you (I think) are mixing up.

1) trimming is done on the fastq files, but rmdup works on aligned data

2) duplicates are defined by the 5' ends of the paired-end data, while trimming takes away bases from the 3' end, so no worries, trimming will not affect duplicate removal.

ADD REPLY • link 6.7 years ago by ATpoint 82k

0

Entering edit mode

"external coordinates" means both 5' as well as 3', doesn't it?

ADD REPLY • link 6.7 years ago by deepti1rao ▴ 50

1

Entering edit mode

No, as in paired-end, the insert size, which defines the fragment, are solely defined by the 5' ends of the respective fwd and rev reads. Have a look at this figure, you can see that no matter where the 3' ends are (so the arrow heads), the 5' ends are unaffected by this, and so is the insert size and by this the definiton of a duplicate.

ADD REPLY • link 6.7 years ago by ATpoint 82k

0

Entering edit mode

Thanks! Got it! Why do I still have 14% duplicates?

ADD REPLY • link 6.7 years ago by deepti1rao ▴ 50

0

Entering edit mode

Also, I had 23% duplicates earlier. After using rmdup, I'm still left with 14% duplicates.

ADD REPLY • link 6.7 years ago by deepti1rao ▴ 50

0

Entering edit mode

samtools rmdum does not remove duplicates when paired reads map to different chromosomes. Do these 14% duplicates left map to different chromosomes?

ADD REPLY • link 6.7 years ago by Tom_L ▴ 350

0

Entering edit mode

How can I find this out?

ADD REPLY • link 6.7 years ago by deepti1rao ▴ 50

0

Entering edit mode

Hello,

how do you trimm your reads? When trimming it can happen that not all reads survive it because they are to short now.

fin swimmer

ADD REPLY • link 6.7 years ago by finswimmer 16k

0

Entering edit mode

I used bbduk from bbtools to trim my reads. Yes, I lost some reads in the process of doing so, but not many.

ADD REPLY • link 6.7 years ago by deepti1rao ▴ 50

score 1 · Answer 1 · 2017-08-21

1

Entering edit mode

6.7 years ago

GenoMax 141k

clumpify will allow you to remove all kinds of duplicates: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. You also do not need to align the data to remove duplicates.

What kind of an experiment is this?

ADD COMMENT • link 6.7 years ago by GenoMax 141k

0

Entering edit mode

I'm aware that there are some tools,which remove duplicates from reads, without the need to align them. But, I dont understand how these can make out the difference between a PCR duplicate and a repeat in the genome.

ADD REPLY • link 6.7 years ago by deepti1rao ▴ 50

0

Entering edit mode

But, I dont understand how these can make out the difference between a PCR duplicate and a repeat in the genome.

If reads are identical over the entire length of the read then there is no need to worry about if they come from two different repeats in the genome. You can't distinguish them by any means.

ADD REPLY • link 6.7 years ago by GenoMax 141k

1

Entering edit mode

The situation you describe would lead to a low mapping quality during the alignment, probably even 0. Good practice is to discard these reads categorically because, as you say, one cannot distinguish properly. I typically remove duplicates after all the filtering steps. For standard assays like ATAC or ChIP, keep only reads that are properly-paired, primary alignments, with a MAPQ >=30 and a reasonable insert size (2000bp usually), followed by duplicate removal.

ADD REPLY • link 6.7 years ago by ATpoint 82k

0

Entering edit mode

I'm worried about bias owing to duplicates, during variant calling.

ADD REPLY • link 6.7 years ago by deepti1rao ▴ 50

score 0 · Answer 2 · 2017-08-21

0

Entering edit mode

6.7 years ago

vmicrobio ▴ 290

I use prinseq (-derep option) to remove duplicates

ADD COMMENT • link 6.7 years ago by vmicrobio ▴ 290

0

Entering edit mode

Thanks! Will try prinseq!

ADD REPLY • link 6.7 years ago by deepti1rao ▴ 50