Question: Removing Duplicates before aligning
0
gravatar for newbinf
3 months ago by
newbinf0
newbinf0 wrote:

Hi,

I'm trying to get variants from amplicon-based sequencing reads. These reads have: primer adapters and barcodes on both ends. I'm looking into the GATK pipeline and Samtools/VarScan pipelines.

I was able to remove the primer sequences on both sides using cutadapt.

Next, I aligned my reads using BWA-mem. Then, I removed duplicate reads (to remove PCR duplicates) using SamTools' markdup. However, aligning removed the barcodes on both ends and deduplicating removed most of my reads. I'm looking into Picard's MarkDuplicates, but that also does not seem to be applicable to amplicon-based reads because it's based on the start position of the reads and would delete a majority of my reads.

Is there any way to remove identical sequences for amplicon-based reads? Furthermore, I want the barcode identifiers to remain after aligning. How would I do that?

Thank you!

next-gen dna-seq gatk • 318 views
ADD COMMENTlink modified 3 months ago by gb430 • written 3 months ago by newbinf0
1

Do not remove duplicates with amplicon data, your reads are, by definition, all duplicates. Aligning with BWA should not remove the barcodes, they should have been soft-clipped, but should still be there.

Is there any way to remove identical sequences for amplicon-based reads? Furthermore, I want the barcode identifiers to remain after aligning. How would I do that?

You want to keep one correct read, and lots of reads with errors? Why remove identical reads?

ADD REPLYlink modified 3 months ago • written 3 months ago by h.mon20k

I want to delete reads with the exact same sequences (including barcodes) so that I can eliminate any PCR duplicates. I want to make sure that my future variant analysis is not biased because of PCR duplicates.

ADD REPLYlink written 3 months ago by newbinf0
1

You have amplicon-based reads. By definition, all reads you see are PCR duplicates.

ADD REPLYlink written 12 weeks ago by WouterDeCoster32k

Hello newbinf,

Don't forget to follow up on your threads.

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLYlink written 12 weeks ago by h.mon20k
5
gravatar for gb
3 months ago by
gb430
gb430 wrote:

You should not just delete the same sequences, it is information that other tools can use. You should use USEARCH or VSEARCH dereplication.

USEARCH: https://www.drive5.com/usearch/manual/cmd_fastx_uniques.html

VSEARCH:

vsearch --derep_fulllength [input.fa] --output [output.fa] -sizeout -uc uc_out

With those tools you do "remove" the duplicates. If a sequence has 30 duplicates is keeps one and places a "size:30" in the fasta header. The abundance of these duplicates is usefull. Because you amplify DNA you suppose to have more then one sequence per amplicon. So reads with an abundance of one is most of the time junk.

Specifically in your case you want to check variants. So I would think that you should only use reads that are present at least a X number of times. If you have sequence A and B and A has an abundance of 300 and B of 2. If the difference between A and B is only one base it is probably a sequence error. If you have an abundance of A=300 and B=300 and they only differ one base it is probably a variant.

ADD COMMENTlink modified 3 months ago • written 3 months ago by gb430

Wow, thanks for your answer. I also have some amplicon-based data and I didn't do mark duplicates job because fastp shows the duplication rates is more than 96%. Now I know I should use vsearch to mark duplicates.

ADD REPLYlink written 12 weeks ago by MatthewP10

Sorry, the comment was meant to newbinf. I moved it to his post now. But it also serves as a tip to you: if you found an answer helpful, and it solved your problem, you may also upvote it.

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by h.mon20k
1
gravatar for doctor.dee005
3 months ago by
doctor.dee005170
Bioinformatics Center, Pune
doctor.dee005170 wrote:

If you are familiar with QIIME pipeline, it has clustering process in its workflow which clusters all sequences on basis of sequence similarity and then followed by picking representatives from each cluster. If you want to remove duplicates (i.e 100% similar), then use pick_otus.py using usearch method with sequence similarity 100%. After that pick representatives using pick_rep_set.py.


Ideally, your final outut of pick_rep_set.py should contain reads without duplicates.

Good luck I have tried this.

ADD COMMENTlink written 3 months ago by doctor.dee005170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1586 users visited in the last hour