Optimize a script that extract features from Fasta file using biopython
2
0
Entering edit mode
2.1 years ago
Michal Nevo ▴ 130

Hey,

I have a script that extract features from a large fasta file (1767 MB) using biopython.

I am sending it as a bash job via ssh remote server. The job is running for two days now.. Is there a way to optimize my script? I think maybe the problem is in genome_file.seek(0)

import sys
from Bio import SeqIO
import pandas as pd
from Bio.SeqFeature import SeqFeature, FeatureLocation


class target_features():

    def __init__(self, table_file, male_ref_genome, output_file_name):
        self.features_table = df_from_features_table(table_file)
        self.genome_file = male_ref_genome
        self.output = output_file_name

    def extract_target(self):
        """Grab ten first mRNA features from the features_df
        get the matching chromosome seq from the genome file
        create a feature object and write it to an output fasta file. """
        out = open(self.output, "a")
        genome_file = open(self.genome_file)
        genome_sequence = SeqIO.parse(genome_file, 'fasta')
        #mrna_counter = 0
        for index, row in self.features_table.iterrows():
            #if mrna_counter >= 11:
            #    break
            if row['# feature'] == 'mRNA':
                #mrna_counter = mrna_counter + 1
                chr = row['genomic_accession']
                start = row['start']
                stop = row['end']
                strand = row['strand']
                product_accession = row['product_accession']
                related_accession = row['related_accession']
                name = row['name']
                assert strand == '+' or strand == '-', "strand value incorrect {}".format(strand)
                strand = 1 if strand == '+' else -1
                target_feature = SeqFeature(FeatureLocation(start, stop, strand=strand))
                # find chromosome seq by id
                for record in genome_sequence:
                    if record.id == chr:
                        target_feature.location.extract(record.seq)
                        line = '>{}|{}|{}|{}|{}|{}|{}|{}\n{}\n'.format(record.id, start, stop, strand, product_accession, related_accession, name, stop-start+1, target_feature.location.extract(record.seq))
                        out.write(line)
                        break
                genome_file.seek(0)
                genome_sequence = SeqIO.parse(genome_file, 'fasta')



def df_from_features_table(table_file):
    features_df = pd.read_csv(table_file, sep="\t")
    return features_df



if __name__ == '__main__':

    genome = sys.argv[-3]
    features_data_file = sys.argv[-2]
    out = sys.argv[-1]
    print('genome: ', genome)
    print('feature table: ', features_data_file)
    print('out: ', out+'\n')

    target_features = target_features(features_data_file, genome, out)
    print('AFTER CREATE target_features')
    target_features.extract_target()
    print('AFTER extract target')
features biopython Optimize • 1.4k views
ADD COMMENT
0
Entering edit mode

Providing a small example of the input and the expected output usually helps to get better answers. Having said this and without knowing what the script does specifically, reading a fasta file (especially if it is big) multiple times is subptimal, genome_sequence = SeqIO.parse(genome_file, 'fasta'). Maybe better to give to the main function the genome_sequence variable directly.

ADD REPLY
1
Entering edit mode
2.1 years ago

Hi, seek on the file is certainly not the bottleneck here.

What you are doing is:

  1. iterating over potentially unordered rows
  2. parsing large file (genomes) for every iteration

What you should do (assuming that 1.7gb can fit into memory).

  1. read all your genome seqs to dict and access it with seq.id key (Iterate over seqs once, store for rapid access.)

  2. now iterate over features (pandas table) to extract whatever you want

If you are memory limited:

  1. Iterate over the genome sequences (one at a time)
  2. and extract all the features (from the pandas table) for this genome during this iteration (select your 10 features that you want for this genome first by slicing and do not iterate over the whole pandas table)
ADD COMMENT
0
Entering edit mode

Thank you, Instead of going over this huge Fasta file for each feature from the table I really need to go through the records and the table at the same time in a serial manner.

ADD REPLY
0
Entering edit mode
2.1 years ago
Michal Nevo ▴ 130

I changed my strategy based on the fact that the table contains the chromosomes in a serial way.

I iterate over the genome sequences one record at a time with next(genome_iterator).

Then, I extract all mrna features from the pandas table for the current record. When the genomic_accession is changing to the next record.id, I am getting the next record from the Genome Fasta file and so on..

class TargetFeatures:

    def __init__(self, table_file, male_ref_genome, output_file_name):
        self.features_table = df_from_features_table(table_file)
        self.genome_file = male_ref_genome
        self.output = output_file_name

    def extract_target(self):
        """Iterate the genome records
        for each record, grab a mRNA feature from the features table data frame,
        create a feature object and write it to an output fasta file. """
        out = open(self.output, "a")
        genome_file = open(self.genome_file)
        genome_sequence_iterator = SeqIO.parse(genome_file, 'fasta')
        curr_record = next(genome_sequence_iterator)
        print('FIRST RECORD: {}'.format(curr_record.id))
        #  find current record mRNA features - serial traversal on features for each record:
        for index,row in self.features_table.iterrows():
            if row['# feature'] == 'mRNA':
                chr = row['genomic_accession']
                if curr_record.id != chr:
                    curr_record = next(genome_sequence_iterator)
                    print('NEXT RECORD: {}'.format(curr_record.id))
                # check its the right record on features table:
                if curr_record.id == chr:
                    start = row['start']
                    stop = row['end']
                    strand = row['strand']
                    product_accession = row['product_accession']
                    related_accession = row['related_accession']
                    name = row['name']
                    assert strand == '+' or strand == '-', "strand value incorrect {}".format(strand)
                    strand = 1 if strand == '+' else -1
                    target_feature = SeqFeature(FeatureLocation(start, stop, strand=strand))
                    feature = target_feature.location.extract(curr_record.seq)

                    line = '>{}|{}|{}|{}|{}|{}|{}|{}\n{}\n'.format(curr_record.id, start, stop, strand, product_accession, related_accession, name, stop-start+1, feature)
                    out.write(line)
ADD COMMENT
1
Entering edit mode

What about this (should be faster and more readable), but untested, as I don't have test data...

def extract_target(self):
    """Iterate the genome records
    for each record, grab a mRNA feature from the features table data frame,
    create a feature object and write it to an output fasta file. """

    out = open(self.output, "a")
    genome_file = open(self.genome_file)
    genome_sequence_iterator = SeqIO.parse(genome_file, 'fasta')

    for curr_record in genome_sequence_iterator:
        print('FIRST RECORD: {}'.format(curr_record.id))

        # select only the rows for the genome
        # you can add this to select only the features you are interested in
        # (select only rows for the current record) AND (select only mRNAs)
        ft = self.features_table
        subset = ft.loc[(ft.genomic_accession == curr_record.id) & (ft['# feature'] == 'mRNA'), :]

        for row in subset.itertuples():  # itertuples is MUCH faster then iterrows, but doesn't handle all possible column names (pandas named tuple object)
            chr = row.genomic_accession
            start = row.start
            stop = row.end
            strand = row.strand
            product_accession = row.product_accession
            related_accession = row.related_accession
            name = row.name

            assert strand == '+' or strand == '-', "strand value incorrect {}".format(strand)
            strand = 1 if strand == '+' else -1
            target_feature = SeqFeature(FeatureLocation(start, stop, strand=strand))
            feature = target_feature.location.extract(curr_record.seq)

            line = '>{}|{}|{}|{}|{}|{}|{}|{}\n{}\n'.format(curr_record.id, start, stop, strand, product_accession, related_accession, name, stop-start+1, feature)
            out.write(line)

    genome_file.close() # don't forget this...
ADD REPLY
0
Entering edit mode

Thank you, This is indeed a much more beautiful code.

Do you have any idea why the file I got following your script is 5MB bigger than the file that came out with my script?

ADD REPLY
0
Entering edit mode

This is where I got the features table and the fasta file : GCF_010645085.1_ASM1064508v1_feature_table.txt.gz

GCF_010645085.1_ASM1064508v1_genomic.fna.gz

ADD REPLY
0
Entering edit mode

Your code assumes that the order of the sequence IDs is same in the feature table as is in the fasta file, which might not be true.

# especially this part of code
if curr_record.id != chr:
  curr_record = next(genome_sequence_iterator)

You can skip sequences if the order is not the same.

ADD REPLY
0
Entering edit mode

Thank you so much:)

ADD REPLY

Login before adding your answer.

Traffic: 1966 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6