Question: Multiple sequence alignment clean-up
0
gravatar for Shaminur
6 months ago by
Shaminur60
Dhaka University
Shaminur60 wrote:

I want to clean my multiple sequence alignment file (FASTA) in which lots of ambiguity like N, gap Y, K are present. From where I want to remove all sequences containing at least 1 ambiguous character. I have tried this code (https://biopython.org/wiki/Sequence_Cleaner), But It also did duplicate removal at the same time, But I do not want to remove duplicate. How can I modify this code?

Thanks in Advance

    #!/usr/bin/env python
import sys
from Bio import SeqIO

def sequence_cleaner(fasta_file, min_length=0, por_n=100):
    # Create our hash table to add the sequences
    sequences={}

    # Using the Biopython fasta parse we can read our fasta input
    for seq_record in SeqIO.parse(fasta_file, "fasta"):
        # Take the current sequence
        sequence = str(seq_record.seq).upper()
        # Check if the current sequence is according to the user parameters
        if (len(sequence) >= min_length and
            (float(sequence.count("N"))/float(len(sequence)))*100 <= por_n):
        # If the sequence passed in the test "is it clean?" and it isn't in the
        # hash table, the sequence and its id are going to be in the hash
            if sequence not in sequences:
                sequences[sequence] = seq_record.id
       # If it is already in the hash table, we're just gonna concatenate the ID
       # of the current sequence to another one that is already in the hash table
            else:
                sequences[sequence] += "_" + seq_record.id


    # Write the clean sequences

    # Create a file in the same directory where you ran this script
    with open("clear_" + fasta_file, "w+") as output_file:
        # Just read the hash table and write on the file as a fasta format
        for sequence in sequences:
            output_file.write(">" + sequences[sequence] + "\n" + sequence + "\n")

    print("CLEAN!!!\nPlease check clear_" + fasta_file)


userParameters = sys.argv[1:]

try:
    if len(userParameters) == 1:
        sequence_cleaner(userParameters[0])
    elif len(userParameters) == 2:
        sequence_cleaner(userParameters[0], float(userParameters[1]))
    elif len(userParameters) == 3:
        sequence_cleaner(userParameters[0], float(userParameters[1]),
                         float(userParameters[2]))
    else:
        print("There is a problem!")
except:
    print("There is a problem!")
alignment • 280 views
ADD COMMENTlink modified 6 months ago by onestop_data260 • written 6 months ago by Shaminur60
3
gravatar for onestop_data
6 months ago by
onestop_data260
onestop_data260 wrote:

Hi - this script was re-written using pysam and should do what you want. Just make sure you pass the flag --keep_all_duplicates

https://onestopdataanalysis.com/clean-fasta-fastq-file/

ADD COMMENTlink modified 6 months ago • written 6 months ago by onestop_data260

Thank You Sir, It works, But In my sequences, there are gaps, Are there any way to remove the gap?

ADD REPLYlink written 6 months ago by Shaminur60
1

Try running it with the flag --remove_ambiguous

ADD REPLYlink written 6 months ago by onestop_data260
1
gravatar for tamerg
6 months ago by
tamerg90
tamerg90 wrote:

Hi, try with the following way it should keep the duplicates.

import sys
from Bio import SeqIO


def sequence_cleaner(fasta_file, min_length=0, por_n=100):
    # Create array to add the sequences
    sequences = []

    # Using the Biopython fasta parse we can read our fasta input
    for seq_record in SeqIO.parse(fasta_file, "fasta"):
        # Take the current sequence
        sequence = str(seq_record.seq).upper()
        # Check if the current sequence is according to the user parameters
        if (len(sequence) >= min_length and
                (float(sequence.count("N"))/float(len(sequence)))*100 <= por_n):
            # If the sequence passed in the test "is it clean?" add to array
            sequences.append(">"+seq_record.id)
            sequences.append(sequence)

    # Write the clean sequences

    # Create a file in the same directory where you ran this script
    with open("clear_" + fasta_file, "w+") as output_file:
        # Just read the array and write to file
        for sequence in sequences:
            output_file.write(sequence + "\n")

    print("CLEAN!!!\nPlease check clear_" + fasta_file)
ADD COMMENTlink written 6 months ago by tamerg90

Thank you very much, Tamerg, But please, can you have a look more carefully as it has probably some problem.

ADD REPLYlink written 6 months ago by Shaminur60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1093 users visited in the last hour