Question

Remove sequences with (50% gaps) from MSA

2

Entering edit mode

4.0 years ago

nataliagru1 ▴ 90

How do I remove sequences from my MSA that contain 50% gaps? I know there are various posts about removing columns with gaps. But I'm looking for a simple script to identify alignments within my MSA that have >50% gaps "-" and remove them. I am not well versed in BioPython AlignIO. I feel as though this problem is simple. Any guidance is greatly appreciated.

Multiple-Sequence-Alignment sequence MSA alignment • 3.8k views

ADD COMMENT • link updated 9 months ago by Mensur Dlakic ★ 27k • written 4.0 years ago by nataliagru1 ▴ 90

1

Entering edit mode

Unless you have a reference sequence, this is not a simple problem. Let's say you have 9 sequences that are around 100 residues each, and one that is homolgous to all of them but has an inserted domain that brings its size to 200 residues. When you align them, those 9 sequences would likely have at least 50% gaps and be removed. A potential workaround would be to first trim the columns that have majority of positions in gaps, and then remove sequences with 50% overall gaps. Everything should be realigned afterwards.

ADD REPLY • link 4.0 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you for your response. Yes, I first did remove columns that had a majority position in gaps with trimAL. It's the later I am having an issue with. I am uncertain of trimAL parameters to remove undesirable sequences and was, therefore, wondering if there was a simple way to write a script to remove sequence with 50% overall gaps.

ADD REPLY • link 4.0 years ago by nataliagru1 ▴ 90

Ram · Answer 1 · 2020-04-24

6

Entering edit mode

4.0 years ago

Mensur Dlakic ★ 27k

See if the script below works for you. The alignment has to be in FASTA format, though it can probably be modified easily to take other formats as well.

python fasta_drop.py original_aln.fas new_aln.fas 0.5

Gap fraction cutoff must be entered on a 0-1 scale.

import sys
from Bio import SeqIO

FastaFile = open(sys.argv[1], 'r')
FastaDroppedFile = open(sys.argv[2], 'w')
drop_cutoff = float(sys.argv[3])

if (drop_cutoff > 1) or (drop_cutoff < 0):
    print('\n Sequence drop cutoff must be in 0-1 range !\n')
    sys.exit(1)

for seqs in SeqIO.parse(FastaFile, 'fasta'):
    name = seqs.id
    seq = seqs.seq
    seqLen = len(seqs)
    gap_count = 0
    for z in range(seqLen):
        if seq[z]=='-':
            gap_count += 1
    if (gap_count/float(seqLen)) >= drop_cutoff:
        print(' %s was removed.' % name)
    else:
        SeqIO.write(seqs, FastaDroppedFile, 'fasta')

FastaFile.close()
FastaDroppedFile.close()

ADD COMMENT • link updated 9 months ago by Ram 43k • written 4.0 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you very kindly for your response!! Very helpful indeed! I had some computer issues and am now re-installing biopython and other programs! Will test your code and report back once I have everything up and running! Greatly appreciate you taking the time!

ADD REPLY • link 4.0 years ago by nataliagru1 ▴ 90

0

Entering edit mode

If I wanted to do something similar but for AA ambiguous codes, how would I change this? For example, I want to run this on thousands of alignments which all have 103 samples. But some samples are represented by >75% of "X"s rather than gaps.

ADD REPLY • link 9 months ago by Jenna • 0

0

Entering edit mode

Instead of counting gaps in the code below, you count all possible ambiguous characters.

for z in range(seqLen):
    if seq[z]=='-':
        gap_count += 1

ADD REPLY • link 9 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Would I change the '-' to 'X' ?

for z in range(seqLen):
    if seq[z]=='X':
        gap_count += 1

ADD REPLY • link updated 9 months ago by GenoMax 141k • written 9 months ago by Jenna • 0

0

Entering edit mode

This worked, thank you! Now that outputs the files that were dropped to the command window.

I have a for loop to do this for 8,000 protein alignments (right now pointing to a test folder with 4 alignments):

for f in `ls test/*fasta`
do
        python drop_fasta.py ${f} test-2/`basename ${f}` 0.5 >> Output-test.txt
done

I want to be able to get some stats on the samples that were removed by using this output. I'd like to be able to get a summary output file that counts how many samples were removed from all the alignments, what their names are, and how many times each of those samples were removed from alignments. Then I could just do some simple math and get the total number of alignments each sample is present in.

do you have a suggestion how to do this last step?

ADD REPLY • link 9 months ago by Jenna • 0

0

Entering edit mode

You will need to modify a script on your own to do exactly what you want. It shouldn't be difficult as most of the work is already done. You also may want to redirect the screen output into a file.

ADD REPLY • link 9 months ago by Mensur Dlakic ★ 27k