Question: Best way to processing huge fasta files using python.
0
gravatar for elisheva
21 months ago by
elisheva70
Israel
elisheva70 wrote:

Hello everyone.

I have a vary basic question but I am kind of loose in it.
I wrote a python script to do some calculations on a fasta file with multiple sequences.
when I try to do it my computer hangs.
Is there is any way to read the sequences so it will work properly?
Reading specific number of characters each time won't help - because I need the whole sequence at once.
The goal is to read all cDNAs of all genomes.

For example my file looks like this: (This is just a few sequences)

>CCE57618 cdna:annotated plasmid:HUSEC2011CHR1:pHUSEC2011-2:166:1143:1 gene:HUS2011_pII0001 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:repA description:replication initiation protein RepFIB
GTGGATAAGTCGTCCGGTGAGCTGGTGACACTGACACCAAACAATAACAACACCGTACAA
CCTGTGGCGCTGATGCGTCTGGGCGTCTTTGTACCGACCCTTAAGTCACTGAAGAACAGT
AAAAAAAATACACTGTCACGCACTGATGCCACGGAAGAACTGACGCGTCTTTCCCTGGCC
CGTGCAGAAGGATTCGATAAGGTTGAGATCACCGGTCCCCGGCTGGATATGGATAACGAT
TTCAAGACCTGGGTGGGGATCATTCATTCCTTTGCCCGCCATAACGTGACTGGTGACAAA
GTTGAACTGCCTTTTGTCGAGTTTGCAAAACTGTGTGGTATACCTTCAAGCCAGTCATCA
CGCAGGCTGCGTGAGCGCATCAGCCCTTCCCTGAAACGCATTGCCGGTACCGTGATCTCC
TTTTCCCGCACCGATGAGAAGCACACCCGGGAATATATCACCCATCTGGTACAGTCAGCC
TACTACGATACTGAACGGGATATTGTTCAGTTACAGGCTGATCCCCGCCTGTTTGAACTG
TACCAGTTTGACAGAAAGGTCCTTCTTCAGCTTAAGGCGATTAATGCCCTGAAGCGACGG
GAGTCCGCCCAGGCACTTTACACCTTTATAGAGAGCCTGCCCCGGGATCCGGCACCGATA
TCGCTGGCGCGGCTACGTGCCCGCCTCAATCTGAAGTCTCCGGTATTTTCCCAGAACCAG
ACGGTCAGACGGGCAATGGAGCAGTTACGTGAGATTGGATATCTTGATTACACGGAGATC
CAGCGGGGGCGAACAAAACTCTTCTGTATTCACTACCGACGTCCCCGGTTAAAAGCGCCG
AATGATGAGAGTAAGGAAAATCCGTTGCCACCTTCACCTGTGGAAAAAGTCAGTCCGGAG
ATGGCGGAGAAACTTGCCCTGCTTGAAAAACTGGGCATCACGCTGGATGACCTGGAAAAA
CTCTTCAAATCCCGCTGA
>CCE57619 cdna:annotated plasmid:HUSEC2011CHR1:pHUSEC2011-2:1219:1338:-1 gene:HUS2011_pII0002 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
ATGTTTTATGAAGGGAGCAATGCCTCAGCATCAGGTTACGGGGTGACTCACGTAAGGGAC
AGGCAGATGGCAGCTCAGCCACAGGCAGCACTGCAGGAAACTGAATATAAACTGCAGTGA
>CCE57620 cdna:annotated plasmid:HUSEC2011CHR1:pHUSEC2011-2:1422:2162:-1 gene:HUS2011_pII0003 gene_biotype:protein_coding transcript_biotype:protein_coding description:site-specific recombinase
ATGAACAATGTCATTCCCCTGCAGAATTCACCAGAACGCGTCTCCCTGTTACCCATTGCG
CCGGGGGTGGATTTTGCAACAGCGCTCTCCCTGAGAAGAATGGCCACTTCCACGGGGGCC
ACACCGGCCTACCTGCTGGCCCCGGAAGTGAGTGCCCTTCTTTTCTATATGCCGGATCAG
CGTCACCATATGCTGTTCGCCACCCTCTGGAATACCGGAATGCGTATTGGCGAAGCCCGG
ATGCTGACACCGGAATCATTTGACCTGGATGGAGTAAGACCGTTTGTGCGGATCCAGTCC
GAAAAAGTGCGTGCGCGACGCGGACGCCCGCCAAAAGATGAAGTGCGCCTGGTTCCGCTG
ACAGATATAAGCTATGTCAGGCAGATGGAAAGCTGGATGATCACCACCCGGCCCCGTCGT
CGTGAACCATTATGGGCCGTGACCGACGAAACCATGCGCAACTGGCTGAAGCAGGCTGTC
AGACGGGCCGAAGCTGACGGGGTACACTTTTCGATTCCGGTAACACCACACACTTTCCGG
CACAGCTATATCATGCACATGCTCTATCACCGCCAGCCCCGGAAAGTCATCCAGGCACTG
GCTGGTCACAGGGATCCACGTTCGATGGAGGTCTATACACGGGTGTTTGCGCTGGATATG
GCTGCCACGCTGGCAGTGCCTTTCACAGGTGACGGACGGGATGCTGCAGAGATCCTGCGT
ACACTGCCTCCCCTGAAGTAA

And here what I wrote so far:

name = raw_input("file name: ") #Inputs the file's name
s = ""
ID = ""
with open(name,'r') as cDNAs:
    for line in cDNAs:
        if line.startswith('>'): #If it's a header:
            if(ID != ""): #If there is a sequence.
            print ID + "\n" + s + 2 * "\n"
                cDNA(ID,s)
            ID = line[1: line.find(' ')]
            s = "" #Initializes the sequence.
        else:
            s += line #The sequence itself.
            s = s.replace("\n", "") #Removes all the end of lines.

print ID + "\n" + s + 2 * "\n"
cDNA(ID,s)

Thanks!!!

python fasta • 1.8k views
ADD COMMENTlink modified 21 months ago by unksci150 • written 21 months ago by elisheva70
1

How big the fasta file is and what kind of computation you need?

ADD REPLYlink written 21 months ago by shenwei3564.5k
2
gravatar for unksci
21 months ago by
unksci150
unksci150 wrote:

When processing large files with many FASTA sequences one potential reason for hanging the computer is a lack of sufficient memory.

While such a scenario can be avoided in multiple ways, a particularly convenient option is to use BioPython and its parser, which allows you to process individual entries sequentially

# Install the biopython library (if not already installed)
!pip install biopython

# Import parts of Biopython
from Bio import SeqIO


# File path to your FASTA file
path_to_file = 'example.fasta'   # <--- substitute by your local path

# Open file with "with" statement to avoid problems with access 
# to original file (in case computer hangs
# or there will be any other problem)
with open(path_to_file, mode='r') as handle:

    # Use Biopython's parse function to process individual
    # FASTA records (thus reducing memory footprint)
    for record in SeqIO.parse(handle, 'fasta'):

        # Extract individual parts of the FASTA record
        identifier = record.id
        description = record.description
        sequence = record.seq

        # Example: adapt to extract features you are interested in
        print('----------------------------------------------------------')
        print('Processing the record {}:'.format(identifier))
        print('Its description is: \n{}'.format(description))
        amount_of_nucleotides = len(sequence)
        print('Its sequence contains {} nucleotides.'.format(amount_of_nucleotides))
ADD COMMENTlink modified 21 months ago • written 21 months ago by unksci150
1

Whereas you are right about processing one entry at a time to avoid excessive memory usage when reading very large FASTA files, my experience is that BioPython is quite slow at parsing FASTA files. I would thus not use it if speed is a major concern.

ADD REPLYlink written 21 months ago by Lars Juhl Jensen11k

And if the speed indeed important to me? At least I want the Algorithm to run on all genomes. Is there a better way for this? (even not on python...)

ADD REPLYlink written 21 months ago by elisheva70

The right way to go depends a lot on what you want to do. Reading the FASTA file is clearly not the end goal here - I assume you want to do something with the sequences. My suggestion would be that you post a new question, in which you explain what the actual problem you are trying to solve is. That would make it so much easier for us to give you a good answer.

ADD REPLYlink modified 21 months ago • written 21 months ago by Lars Juhl Jensen11k
1

O.K. I will do it. tnx.

ADD REPLYlink written 21 months ago by elisheva70
1

You dont need with open() before SeqIO parse I believe. SeqIO should handle the opening of the file if its passed a filepath as a variable. Certainly this is true of SeqIO.read and SeqIO.convert so I believe will be the case for parse too.

ADD REPLYlink modified 21 months ago • written 21 months ago by jrj.healey10k

You are right indeed.

ADD REPLYlink written 21 months ago by WouterDeCoster36k

While "with" is not strictly required for SeqIO to parse, it's usage is motivated by python's error handling and the unlocking /closing of the fasta file in the preanticipation of a possible and vaguely defined "hanging of the computer" (OP) (as that hanging could also be because of OP's code that processes individual records)

ADD REPLYlink modified 21 months ago • written 21 months ago by unksci150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 634 users visited in the last hour