How do I remove sequences from my MSA that contain 50% gaps? I know there are various posts about removing columns with gaps. But I'm looking for a simple script to identify alignments within my MSA that have >50% gaps "-" and remove them. I am not well versed in BioPython AlignIO. I feel as though this problem is simple. Any guidance is greatly appreciated.
See if the script below works for you. The alignment has to be in FASTA format, though it can probably be modified easily to take other formats as well.
python fasta_drop.py original_aln.fas new_aln.fas 0.5
Gap fraction cutoff must be entered on a
import sys from Bio import SeqIO FastaFile = open(sys.argv, 'r') FastaDroppedFile = open(sys.argv, 'w') drop_cutoff = float(sys.argv) if (drop_cutoff > 1) or (drop_cutoff < 0): print('\n Sequence drop cutoff must be in 0-1 range !\n') sys.exit(1) for seqs in SeqIO.parse(FastaFile, 'fasta'): name = seqs.id seq = seqs.seq seqLen = len(seqs) gap_count = 0 for z in range(seqLen): if seq[z]=='-': gap_count += 1 if (gap_count/float(seqLen)) >= drop_cutoff: print(' %s was removed.' % name) else: SeqIO.write(seqs, FastaDroppedFile, 'fasta') FastaFile.close() FastaDroppedFile.close()