Question: Scaleable trimming of GenBank sequences?
gravatar for Philipp Bayer
7 months ago by
Philipp Bayer6.7k
Philipp Bayer6.7k wrote:

I'm trying to automate the building of a reference database for DADA2. As such I'm using esearch to download ~200k fasta sequences for my search term of one gene from GenBank. Many GenBank sequences are joined-up sequences of several genes, and I'd like to trim those unnecessary genes away in order to reduce false positive DADA2 hits later on. I have many known versions of my candidate genes so I can 'just' align my GenBank downloads with those genes, but...

I've thought about using one of my known genes and using Biopython's pairwise alignment functions, but for 200k Genbank sequences that will take forever, and I'll mess it up somehow anyway. Is there any parallelisable tool which 'cleans up' fastas by trimming overhangs using one or several known sequences? My Google-fu is failing me!

alignment • 241 views
ADD COMMENTlink written 7 months ago by Philipp Bayer6.7k

(A few minutes later: Current best pipeline I can find is MAFFT/MUSCLE with gaps turned off, followed by trimAl, in a wrapper script to parallelise? curious to hear more!)

ADD REPLYlink written 7 months ago by Philipp Bayer6.7k
gravatar for michael.ante
7 months ago by
michael.ante3.6k wrote:

Hi Philipp,

This is just an idea: Align the fasta file against your gene as a reference with a tool allowing for soft-clipping. Then parse the bam file utilising the cigar string back into fasta format.

Alternatively, follow your biopython approach but split the multi-fasta into single entry fasta files. Use the script with GNU parallel. AFAIK, the multi-threading option in python is rather useless.



ADD COMMENTlink written 7 months ago by michael.ante3.6k

Amazing idea, I hadn't even thought of soft-clipping! I've currently written a relatively slow script that does what I describe in the OP post, after that I'll give BWA or similar a try, which is probably WAY faster, thanks for the idea!!

Edit: FWIW, here's the simple script I ended up using - ran for three hours. I haven't yet tried soft-clipping alignments yet, that comes next!

Later edit: Wow, this COMPLETELY breaks down if the downloaded sequence is large - Genbank has a few records where 10 genes are sequenced together including my 1 gene, and then the MSA breaks down and you get tiny pieces aligned all over the place, and the trimming doesn't trim much. BWA is probably better for that.

ADD REPLYlink modified 7 months ago • written 7 months ago by Philipp Bayer6.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1756 users visited in the last hour