Question

Multiple sequence alignment clean-up

0

Entering edit mode

4.3 years ago

MSRS ▴ 590

I want to clean my multiple sequence alignment file (FASTA) in which lots of ambiguity like N, gap Y, K are present. From where I want to remove all sequences containing at least 1 ambiguous character. I have tried this code (https://biopython.org/wiki/Sequence_Cleaner), But It also did duplicate removal at the same time, But I do not want to remove duplicate. How can I modify this code?

Thanks in Advance

    #!/usr/bin/env python
import sys
from Bio import SeqIO

def sequence_cleaner(fasta_file, min_length=0, por_n=100):
    # Create our hash table to add the sequences
    sequences={}

    # Using the Biopython fasta parse we can read our fasta input
    for seq_record in SeqIO.parse(fasta_file, "fasta"):
        # Take the current sequence
        sequence = str(seq_record.seq).upper()
        # Check if the current sequence is according to the user parameters
        if (len(sequence) >= min_length and
            (float(sequence.count("N"))/float(len(sequence)))*100 <= por_n):
        # If the sequence passed in the test "is it clean?" and it isn't in the
        # hash table, the sequence and its id are going to be in the hash
            if sequence not in sequences:
                sequences[sequence] = seq_record.id
       # If it is already in the hash table, we're just gonna concatenate the ID
       # of the current sequence to another one that is already in the hash table
            else:
                sequences[sequence] += "_" + seq_record.id


    # Write the clean sequences

    # Create a file in the same directory where you ran this script
    with open("clear_" + fasta_file, "w+") as output_file:
        # Just read the hash table and write on the file as a fasta format
        for sequence in sequences:
            output_file.write(">" + sequences[sequence] + "\n" + sequence + "\n")

    print("CLEAN!!!\nPlease check clear_" + fasta_file)


userParameters = sys.argv[1:]

try:
    if len(userParameters) == 1:
        sequence_cleaner(userParameters[0])
    elif len(userParameters) == 2:
        sequence_cleaner(userParameters[0], float(userParameters[1]))
    elif len(userParameters) == 3:
        sequence_cleaner(userParameters[0], float(userParameters[1]),
                         float(userParameters[2]))
    else:
        print("There is a problem!")
except:
    print("There is a problem!")

alignment • 1.7k views

ADD COMMENT • link updated 4.3 years ago by onestop_data ▴ 330 • written 4.3 years ago by MSRS ▴ 590

score 2 · Accepted Answer · 2020-04-21

2

Entering edit mode

4.3 years ago

onestop_data ▴ 330

Hi - this script was re-written using pysam and should do what you want. Just make sure you pass the flag --keep_all_duplicates

https://onestopdataanalysis.com/clean-fasta-fastq-file/

ADD COMMENT • link 4.3 years ago by onestop_data ▴ 330

0

Entering edit mode

Thank You Sir, It works, But In my sequences, there are gaps, Are there any way to remove the gap?

ADD REPLY • link 4.3 years ago by MSRS ▴ 590

1

Entering edit mode

Try running it with the flag --remove_ambiguous

ADD REPLY • link 4.3 years ago by onestop_data ▴ 330

score 0 · Accepted Answer · 2020-04-21

Hi, try with the following way it should keep the duplicates.

import sys
from Bio import SeqIO


def sequence_cleaner(fasta_file, min_length=0, por_n=100):
    # Create array to add the sequences
    sequences = []

    # Using the Biopython fasta parse we can read our fasta input
    for seq_record in SeqIO.parse(fasta_file, "fasta"):
        # Take the current sequence
        sequence = str(seq_record.seq).upper()
        # Check if the current sequence is according to the user parameters
        if (len(sequence) >= min_length and
                (float(sequence.count("N"))/float(len(sequence)))*100 <= por_n):
            # If the sequence passed in the test "is it clean?" add to array
            sequences.append(">"+seq_record.id)
            sequences.append(sequence)

    # Write the clean sequences

    # Create a file in the same directory where you ran this script
    with open("clear_" + fasta_file, "w+") as output_file:
        # Just read the array and write to file
        for sequence in sequences:
            output_file.write(sequence + "\n")

    print("CLEAN!!!\nPlease check clear_" + fasta_file)