Question

Scaleable trimming of GenBank sequences?

0

Entering edit mode

4.3 years ago

Philipp Bayer 8.3k

I'm trying to automate the building of a reference database for DADA2. As such I'm using esearch to download ~200k fasta sequences for my search term of one gene from GenBank. Many GenBank sequences are joined-up sequences of several genes, and I'd like to trim those unnecessary genes away in order to reduce false positive DADA2 hits later on. I have many known versions of my candidate genes so I can 'just' align my GenBank downloads with those genes, but...

I've thought about using one of my known genes and using Biopython's pairwise alignment functions, but for 200k Genbank sequences that will take forever, and I'll mess it up somehow anyway. Is there any parallelisable tool which 'cleans up' fastas by trimming overhangs using one or several known sequences? My Google-fu is failing me!

alignment • 967 views

ADD COMMENT • link 4.3 years ago by Philipp Bayer 8.3k

0

Entering edit mode

(A few minutes later: Current best pipeline I can find is MAFFT/MUSCLE with gaps turned off, followed by trimAl, in a wrapper script to parallelise? curious to hear more!)

ADD REPLY • link 4.3 years ago by Philipp Bayer 8.3k

score 2 · Answer 1 · 2019-12-17

2

Entering edit mode

4.3 years ago

michael.ante ★ 3.8k

Hi Philipp,

This is just an idea: Align the fasta file against your gene as a reference with a tool allowing for soft-clipping. Then parse the bam file utilising the cigar string back into fasta format.

Alternatively, follow your biopython approach but split the multi-fasta into single entry fasta files. Use the script with GNU parallel. AFAIK, the multi-threading option in python is rather useless.

Cheers,

Michael

ADD COMMENT • link 4.3 years ago by michael.ante ★ 3.8k

1

Entering edit mode

Amazing idea, I hadn't even thought of soft-clipping! I've currently written a relatively slow script that does what I describe in the OP post, after that I'll give BWA or similar a try, which is probably WAY faster, thanks for the idea!!

Edit: FWIW, here's the simple script I ended up using - ran for three hours. I haven't yet tried soft-clipping alignments yet, that comes next!

Later edit: Wow, this COMPLETELY breaks down if the downloaded sequence is large - Genbank has a few records where 10 genes are sequenced together including my 1 gene, and then the MSA breaks down and you get tiny pieces aligned all over the place, and the trimming doesn't trim much. BWA is probably better for that.

ADD REPLY • link 4.3 years ago by Philipp Bayer 8.3k