Question

Remove short sequences with length < 50% from a multi-fasta file

0

Entering edit mode

2.7 years ago

b.shrestha • 0

Hi All,

I need to remove sequences that are shorter than 50% of the entire alignment in a multi-fasta file. For example, I want to remove the "Seq3" sequence in the example below, as its length is less than half the length of other sequences. I have many multi-fasta files and I don't want to check them all manually. Any suggestions?

example_fasta

Thank you!

Alignment Fasta Sequence • 2.9k views

ADD COMMENT • link updated 16 months ago by Hugo ▴ 380 • written 2.7 years ago by b.shrestha • 0

2

Entering edit mode

16 months ago

DareDevil ★ 4.3k

You can also try:

from Bio import SeqIO

def filter_sequences(input_file, output_file):
    # Open the input file in read mode and output file in write mode
    with open(input_file, "r") as handle, open(output_file, "w") as output_handle:
        records = list(SeqIO.parse(handle, "fasta"))

        # Calculate the minimum sequence length required
        min_length = 0.5 * max(len(record.seq) for record in records)

        # Filter and write sequences longer than the minimum length to the output file
        for record in records:
            if len(record.seq) >= min_length:
                SeqIO.write(record, output_handle, "fasta")

# Provide the path to your input and output files
input_file = "input.fasta"
output_file = "output.fasta"

# Call the function to filter sequences
filter_sequences(input_file, output_file)

ADD COMMENT • link 16 months ago by DareDevil ★ 4.3k

0

Entering edit mode

16 months ago

Hugo ▴ 380

SEDA (https://www.sing-group.org/seda/) includes a "Filtering" operation (https://www.sing-group.org/seda/manual/operations.html#id4) that has a "Remove by size difference" option that allows specifying the desired percentage and the reference secuence to be used in the comparison.

ADD COMMENT • link 16 months ago by Hugo ▴ 380

0

Entering edit mode

16 months ago

benformatics 4.0k

R version:

library(Biostrings) 
## read your fasta
fa <- readDNAStringSet('your.fasta')
## set removal to 50% of the full alignment length in this case ill use 50
removal.cutoff<- 50 
## remove sequences below 50 nt in length
fa <- fa[width(fa) >= removal.cutoff]
## write a new file with your subet
writeXStringSet(fa,'your_new.fasta',format='fasta')

ADD COMMENT • link 16 months ago by benformatics 4.0k

0

Entering edit mode

Hi,

I am trying to build a maximum likelihood phylogenetic tree in R. I have never done this before and need a step-by-step guide on how to do this in R.

I have my files in fasta format in a folder but don't know how to move forward.

Any suggestions will be appreciated especially the packages that I need to install and how to read in multiple fasta files into R to build my tree to obtain a newick output file.

Thanks

ADD REPLY • link 16 months ago by maworh • 0

0

Entering edit mode

Try FastTree using this Docker image: https://hub.docker.com/r/pegi3s/fasttree/

ADD REPLY • link 16 months ago by Hugo ▴ 380

score 3 · Accepted Answer · 2022-03-01

You don't seem to have an alignment, but rather a collection of fasta sequences. If I am interpreting that correctly, seqtk will do the trick:

seqtk seq -L XX file.fas > trimmed.fas

XX in the command above is the cutoff length.

If this is actually a poorly formatted fasta alignment, this script should do the job:

import sys
from Bio import SeqIO

FastaFile = open(sys.argv[1], 'r')
FastaDroppedFile = open(sys.argv[2], 'w')
drop_cutoff = float(sys.argv[3])

if (drop_cutoff > 1) or (drop_cutoff < 0):
    print('\n Sequence drop cutoff must be in 0-1 range !\n')
    sys.exit(1)

for seqs in SeqIO.parse(FastaFile, 'fasta'):
    name = seqs.id
    seq = seqs.seq
    seqLen = len(seqs)
    gap_count = 0
    for z in range(seqLen):
        if seq[z]=='-':
            gap_count += 1
    if (gap_count/float(seqLen)) >= drop_cutoff:
        print(' %s was removed.' % name)
    else:
        SeqIO.write(seqs, FastaDroppedFile, 'fasta')

FastaFile.close()
FastaDroppedFile.close()

Save it as fasta_drop.py and then try:

python fasta_drop.py file.fas trimmed.fas 0.5

This will drop all the sequences that have gaps in >=50% of alignment positions.