Question: Removing Duplicates before aligning
0
gravatar for newbinf
5 months ago by
newbinf0
newbinf0 wrote:

Hi,

I'm trying to get variants from amplicon-based sequencing reads. These reads have: primer adapters and barcodes on both ends. I'm looking into the GATK pipeline and Samtools/VarScan pipelines.

I was able to remove the primer sequences on both sides using cutadapt.

Next, I aligned my reads using BWA-mem. Then, I removed duplicate reads (to remove PCR duplicates) using SamTools' markdup. However, aligning removed the barcodes on both ends and deduplicating removed most of my reads. I'm looking into Picard's MarkDuplicates, but that also does not seem to be applicable to amplicon-based reads because it's based on the start position of the reads and would delete a majority of my reads.

Is there any way to remove identical sequences for amplicon-based reads? Furthermore, I want the barcode identifiers to remain after aligning. How would I do that?

Thank you!

next-gen dna-seq gatk • 438 views
ADD COMMENTlink modified 5 months ago by gb540 • written 5 months ago by newbinf0
1

Do not remove duplicates with amplicon data, your reads are, by definition, all duplicates. Aligning with BWA should not remove the barcodes, they should have been soft-clipped, but should still be there.

Is there any way to remove identical sequences for amplicon-based reads? Furthermore, I want the barcode identifiers to remain after aligning. How would I do that?

You want to keep one correct read, and lots of reads with errors? Why remove identical reads?

ADD REPLYlink modified 5 months ago • written 5 months ago by h.mon22k

I want to delete reads with the exact same sequences (including barcodes) so that I can eliminate any PCR duplicates. I want to make sure that my future variant analysis is not biased because of PCR duplicates.

ADD REPLYlink written 5 months ago by newbinf0
1

You have amplicon-based reads. By definition, all reads you see are PCR duplicates.

ADD REPLYlink written 5 months ago by WouterDeCoster35k

Hello newbinf,

Don't forget to follow up on your threads.

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLYlink written 5 months ago by h.mon22k
5
gravatar for gb
5 months ago by
gb540
gb540 wrote:

You should not just delete the same sequences, it is information that other tools can use. You should use USEARCH or VSEARCH dereplication.

USEARCH: https://www.drive5.com/usearch/manual/cmd_fastx_uniques.html

VSEARCH:

vsearch --derep_fulllength [input.fa] --output [output.fa] -sizeout -uc uc_out

With those tools you do "remove" the duplicates. If a sequence has 30 duplicates is keeps one and places a "size:30" in the fasta header. The abundance of these duplicates is usefull. Because you amplify DNA you suppose to have more then one sequence per amplicon. So reads with an abundance of one is most of the time junk.

Specifically in your case you want to check variants. So I would think that you should only use reads that are present at least a X number of times. If you have sequence A and B and A has an abundance of 300 and B of 2. If the difference between A and B is only one base it is probably a sequence error. If you have an abundance of A=300 and B=300 and they only differ one base it is probably a variant.

ADD COMMENTlink modified 5 months ago • written 5 months ago by gb540

Wow, thanks for your answer. I also have some amplicon-based data and I didn't do mark duplicates job because fastp shows the duplication rates is more than 96%. Now I know I should use vsearch to mark duplicates.

ADD REPLYlink written 5 months ago by MatthewP30

Sorry, the comment was meant to newbinf. I moved it to his post now. But it also serves as a tip to you: if you found an answer helpful, and it solved your problem, you may also upvote it.

ADD REPLYlink modified 5 months ago • written 5 months ago by h.mon22k
1
gravatar for Dattatray Mongad
5 months ago by
National Centre for Cell Science, Pune
Dattatray Mongad240 wrote:

If you are familiar with QIIME pipeline, it has clustering process in its workflow which clusters all sequences on basis of sequence similarity and then followed by picking representatives from each cluster. If you want to remove duplicates (i.e 100% similar), then use pick_otus.py using usearch method with sequence similarity 100%. After that pick representatives using pick_rep_set.py.


Ideally, your final outut of pick_rep_set.py should contain reads without duplicates.

Good luck I have tried this.

ADD COMMENTlink written 5 months ago by Dattatray Mongad240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1682 users visited in the last hour