Best way to processing huge fasta files using python.
1
0
Entering edit mode
7.0 years ago
elisheva ▴ 120

Hello everyone.

I have a vary basic question but I am kind of loose in it.
I wrote a python script to do some calculations on a fasta file with multiple sequences.
when I try to do it my computer hangs.
Is there is any way to read the sequences so it will work properly?
Reading specific number of characters each time won't help - because I need the whole sequence at once.
The goal is to read all cDNAs of all genomes.

For example my file looks like this: (This is just a few sequences)

>CCE57618 cdna:annotated plasmid:HUSEC2011CHR1:pHUSEC2011-2:166:1143:1 gene:HUS2011_pII0001 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:repA description:replication initiation protein RepFIB
GTGGATAAGTCGTCCGGTGAGCTGGTGACACTGACACCAAACAATAACAACACCGTACAA
CCTGTGGCGCTGATGCGTCTGGGCGTCTTTGTACCGACCCTTAAGTCACTGAAGAACAGT
AAAAAAAATACACTGTCACGCACTGATGCCACGGAAGAACTGACGCGTCTTTCCCTGGCC
CGTGCAGAAGGATTCGATAAGGTTGAGATCACCGGTCCCCGGCTGGATATGGATAACGAT
TTCAAGACCTGGGTGGGGATCATTCATTCCTTTGCCCGCCATAACGTGACTGGTGACAAA
GTTGAACTGCCTTTTGTCGAGTTTGCAAAACTGTGTGGTATACCTTCAAGCCAGTCATCA
CGCAGGCTGCGTGAGCGCATCAGCCCTTCCCTGAAACGCATTGCCGGTACCGTGATCTCC
TTTTCCCGCACCGATGAGAAGCACACCCGGGAATATATCACCCATCTGGTACAGTCAGCC
TACTACGATACTGAACGGGATATTGTTCAGTTACAGGCTGATCCCCGCCTGTTTGAACTG
TACCAGTTTGACAGAAAGGTCCTTCTTCAGCTTAAGGCGATTAATGCCCTGAAGCGACGG
GAGTCCGCCCAGGCACTTTACACCTTTATAGAGAGCCTGCCCCGGGATCCGGCACCGATA
TCGCTGGCGCGGCTACGTGCCCGCCTCAATCTGAAGTCTCCGGTATTTTCCCAGAACCAG
ACGGTCAGACGGGCAATGGAGCAGTTACGTGAGATTGGATATCTTGATTACACGGAGATC
CAGCGGGGGCGAACAAAACTCTTCTGTATTCACTACCGACGTCCCCGGTTAAAAGCGCCG
AATGATGAGAGTAAGGAAAATCCGTTGCCACCTTCACCTGTGGAAAAAGTCAGTCCGGAG
ATGGCGGAGAAACTTGCCCTGCTTGAAAAACTGGGCATCACGCTGGATGACCTGGAAAAA
CTCTTCAAATCCCGCTGA
>CCE57619 cdna:annotated plasmid:HUSEC2011CHR1:pHUSEC2011-2:1219:1338:-1 gene:HUS2011_pII0002 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
ATGTTTTATGAAGGGAGCAATGCCTCAGCATCAGGTTACGGGGTGACTCACGTAAGGGAC
AGGCAGATGGCAGCTCAGCCACAGGCAGCACTGCAGGAAACTGAATATAAACTGCAGTGA
>CCE57620 cdna:annotated plasmid:HUSEC2011CHR1:pHUSEC2011-2:1422:2162:-1 gene:HUS2011_pII0003 gene_biotype:protein_coding transcript_biotype:protein_coding description:site-specific recombinase
ATGAACAATGTCATTCCCCTGCAGAATTCACCAGAACGCGTCTCCCTGTTACCCATTGCG
CCGGGGGTGGATTTTGCAACAGCGCTCTCCCTGAGAAGAATGGCCACTTCCACGGGGGCC
ACACCGGCCTACCTGCTGGCCCCGGAAGTGAGTGCCCTTCTTTTCTATATGCCGGATCAG
CGTCACCATATGCTGTTCGCCACCCTCTGGAATACCGGAATGCGTATTGGCGAAGCCCGG
ATGCTGACACCGGAATCATTTGACCTGGATGGAGTAAGACCGTTTGTGCGGATCCAGTCC
GAAAAAGTGCGTGCGCGACGCGGACGCCCGCCAAAAGATGAAGTGCGCCTGGTTCCGCTG
ACAGATATAAGCTATGTCAGGCAGATGGAAAGCTGGATGATCACCACCCGGCCCCGTCGT
CGTGAACCATTATGGGCCGTGACCGACGAAACCATGCGCAACTGGCTGAAGCAGGCTGTC
AGACGGGCCGAAGCTGACGGGGTACACTTTTCGATTCCGGTAACACCACACACTTTCCGG
CACAGCTATATCATGCACATGCTCTATCACCGCCAGCCCCGGAAAGTCATCCAGGCACTG
GCTGGTCACAGGGATCCACGTTCGATGGAGGTCTATACACGGGTGTTTGCGCTGGATATG
GCTGCCACGCTGGCAGTGCCTTTCACAGGTGACGGACGGGATGCTGCAGAGATCCTGCGT
ACACTGCCTCCCCTGAAGTAA

And here what I wrote so far:

name = raw_input("file name: ") #Inputs the file's name
s = ""
ID = ""
with open(name,'r') as cDNAs:
    for line in cDNAs:
        if line.startswith('>'): #If it's a header:
            if(ID != ""): #If there is a sequence.
            print ID + "\n" + s + 2 * "\n"
                cDNA(ID,s)
            ID = line[1: line.find(' ')]
            s = "" #Initializes the sequence.
        else:
            s += line #The sequence itself.
            s = s.replace("\n", "") #Removes all the end of lines.

print ID + "\n" + s + 2 * "\n"
cDNA(ID,s)

Thanks!!!

python fasta • 13k views
ADD COMMENT
1
Entering edit mode

How big the fasta file is and what kind of computation you need?

ADD REPLY
3
Entering edit mode
7.0 years ago
unksci ▴ 180

When processing large files with many FASTA sequences one potential reason for hanging the computer is a lack of sufficient memory.

While such a scenario can be avoided in multiple ways, a particularly convenient option is to use BioPython and its parser, which allows you to process individual entries sequentially

# Install the biopython library (if not already installed)
!pip install biopython

# Import parts of Biopython
from Bio import SeqIO


# File path to your FASTA file
path_to_file = 'example.fasta'   # <--- substitute by your local path

# Open file with "with" statement to avoid problems with access 
# to original file (in case computer hangs
# or there will be any other problem)
with open(path_to_file, mode='r') as handle:

    # Use Biopython's parse function to process individual
    # FASTA records (thus reducing memory footprint)
    for record in SeqIO.parse(handle, 'fasta'):

        # Extract individual parts of the FASTA record
        identifier = record.id
        description = record.description
        sequence = record.seq

        # Example: adapt to extract features you are interested in
        print('----------------------------------------------------------')
        print('Processing the record {}:'.format(identifier))
        print('Its description is: \n{}'.format(description))
        amount_of_nucleotides = len(sequence)
        print('Its sequence contains {} nucleotides.'.format(amount_of_nucleotides))
ADD COMMENT
1
Entering edit mode

Whereas you are right about processing one entry at a time to avoid excessive memory usage when reading very large FASTA files, my experience is that BioPython is quite slow at parsing FASTA files. I would thus not use it if speed is a major concern.

ADD REPLY
0
Entering edit mode

And if the speed indeed important to me? At least I want the Algorithm to run on all genomes. Is there a better way for this? (even not on python...)

ADD REPLY
0
Entering edit mode

The right way to go depends a lot on what you want to do. Reading the FASTA file is clearly not the end goal here - I assume you want to do something with the sequences. My suggestion would be that you post a new question, in which you explain what the actual problem you are trying to solve is. That would make it so much easier for us to give you a good answer.

ADD REPLY
1
Entering edit mode

O.K. I will do it. tnx.

ADD REPLY
1
Entering edit mode

You dont need with open() before SeqIO parse I believe. SeqIO should handle the opening of the file if its passed a filepath as a variable. Certainly this is true of SeqIO.read and SeqIO.convert so I believe will be the case for parse too.

ADD REPLY
0
Entering edit mode

You are right indeed.

ADD REPLY
0
Entering edit mode

While "with" is not strictly required for SeqIO to parse, it's usage is motivated by python's error handling and the unlocking /closing of the fasta file in the preanticipation of a possible and vaguely defined "hanging of the computer" (OP) (as that hanging could also be because of OP's code that processes individual records)

ADD REPLY

Login before adding your answer.

Traffic: 2418 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6