You don't seem to have an alignment, but rather a collection of fasta sequences. If I am interpreting that correctly, seqtk will do the trick:
seqtk seq -L XX file.fas > trimmed.fas
XX
in the command above is the cutoff length.
If this is actually a poorly formatted fasta alignment, this script should do the job:
import sys
from Bio import SeqIO
FastaFile = open(sys.argv[1], 'r')
FastaDroppedFile = open(sys.argv[2], 'w')
drop_cutoff = float(sys.argv[3])
if (drop_cutoff > 1) or (drop_cutoff < 0):
print('\n Sequence drop cutoff must be in 0-1 range !\n')
sys.exit(1)
for seqs in SeqIO.parse(FastaFile, 'fasta'):
name = seqs.id
seq = seqs.seq
seqLen = len(seqs)
gap_count = 0
for z in range(seqLen):
if seq[z]=='-':
gap_count += 1
if (gap_count/float(seqLen)) >= drop_cutoff:
print(' %s was removed.' % name)
else:
SeqIO.write(seqs, FastaDroppedFile, 'fasta')
FastaFile.close()
FastaDroppedFile.close()
Save it as fasta_drop.py
and then try:
python fasta_drop.py file.fas trimmed.fas 0.5
This will drop all the sequences that have gaps in >=50% of alignment positions.
Thanks a lot for the suggestions!
Yes, it's not an alignment rather a collection of fasta sequences that I want to align but only after removing shorter sequences. Sorry about the confusion.
I am familiar with seqtk but for that I need to state cutoff length, which varies for different genes. Proportion instead of fixed cutoff length would be better but couldn't locate proportion in the seqtk manual.
I was able to resolve the issue with running your script. The multi-fasta needs to be aligned before running the script to drop the shorter sequences. Thank you for sharing the script. It was very helpful!