Question: Remove sequences with (50% gaps) from MSA
gravatar for nataliagru1
4 months ago by
nataliagru150 wrote:

How do I remove sequences from my MSA that contain 50% gaps? I know there are various posts about removing columns with gaps. But I'm looking for a simple script to identify alignments within my MSA that have >50% gaps "-" and remove them. I am not well versed in BioPython AlignIO. I feel as though this problem is simple. Any guidance is greatly appreciated.

ADD COMMENTlink modified 4 months ago by Mensur Dlakic6.4k • written 4 months ago by nataliagru150

Unless you have a reference sequence, this is not a simple problem. Let's say you have 9 sequences that are around 100 residues each, and one that is homolgous to all of them but has an inserted domain that brings its size to 200 residues. When you align them, those 9 sequences would likely have at least 50% gaps and be removed. A potential workaround would be to first trim the columns that have majority of positions in gaps, and then remove sequences with 50% overall gaps. Everything should be realigned afterwards.

ADD REPLYlink written 4 months ago by Mensur Dlakic6.4k

Thank you for your response. Yes, I first did remove columns that had a majority position in gaps with trimAL. It's the later I am having an issue with. I am uncertain of trimAL parameters to remove undesirable sequences and was, therefore, wondering if there was a simple way to write a script to remove sequence with 50% overall gaps.

ADD REPLYlink written 4 months ago by nataliagru150
gravatar for Mensur Dlakic
4 months ago by
Mensur Dlakic6.4k
Mensur Dlakic6.4k wrote:

See if the script below works for you. The alignment has to be in FASTA format, though it can probably be modified easily to take other formats as well.

python original_aln.fas new_aln.fas 0.5

Gap fraction cutoff must be entered on a 0-1 scale.

import sys
from Bio import SeqIO

FastaFile = open(sys.argv[1], 'r')
FastaDroppedFile = open(sys.argv[2], 'w')
drop_cutoff = float(sys.argv[3])

if (drop_cutoff > 1) or (drop_cutoff < 0):
    print('\n Sequence drop cutoff must be in 0-1 range !\n')

for seqs in SeqIO.parse(FastaFile, 'fasta'):
    name =
    seq = seqs.seq
    seqLen = len(seqs)
    gap_count = 0
    for z in range(seqLen):
        if seq[z]=='-':
            gap_count += 1
    if (gap_count/float(seqLen)) >= drop_cutoff:
        print(' %s was removed.' % name)
        SeqIO.write(seqs, FastaDroppedFile, 'fasta')

ADD COMMENTlink modified 4 months ago • written 4 months ago by Mensur Dlakic6.4k

Thank you very kindly for your response!! Very helpful indeed! I had some computer issues and am now re-installing biopython and other programs! Will test your code and report back once I have everything up and running! Greatly appreciate you taking the time!

ADD REPLYlink written 4 months ago by nataliagru150
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 719 users visited in the last hour