Question: Add Sequences To A List From A Complex Fasta File In Python
1
gravatar for hicsuntdrac0nis
7.1 years ago by
hicsuntdrac0nis220 wrote:

I'm trying to organize FASTA file with multiple sequences . In doing so, I'm trying to add the names to a list and add the sequences to a separate list that is parallel with the name list . I figured out how to add the names to a list but I can't figure out how to add the sequences that follow it into separate lists . I tried appending the lines of sequence into an empty string but it appended all the lines of all the sequences into a single string .

def Name_Organizer(FASTA,output):

    import os
    import re

    in_file=open(FASTA,'r')
    dir,file=os.path.split(FASTA)
    temp = os.path.join(dir,output)
    out_file=open(temp,'w')

    data=''
    name_list=[]

    for line in in_file:

        line=line.strip()
        for i in line:
            if i=='>':
                name_list.append(line)
                break
            else:
                line=line.upper()
        if all([k==k.upper() for k in line]):
            data=data+line

    print data

how do i add the sequences to a list as a set of strings ?

the input file looks like, but with the > before the name on the top line :

>44664.3|G1E3M3IX1IW|Greengenes|2471 16S ribosomal RNA [Microbacterium oxydans]
gactATAATTTGTAAATTTCTTGAGATAGAATCATTCGTATTGAATGAGGTCAAATTCTC
TAAACTGATTAAGAAGTATAATACTTAGATGCGAGTTATTGCATCACTTAACGGAGAGTT
TGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGTGAAG
TCTGAATTGAGTACTTCGGTATGATATTTGGGTGGAAAGTGGCGGACGGGTGAGTAACAC
GTGGGTAACCTGCCTCGAAGTGGGGACAACCATTGGAAACGATGGCTAATACCGCATAGT
TCTTTAGATGCATGAGCATTTATAGATAAAACTCTGGTGCTTCGAGAGGGGTCTGCGTCC
GATTAGTTAGTTGGTGGGTAAAGGCCTACCAAGACGATGATCGGTAGCTGGTCTGAGAGG
ACGATCAGTCACACGGGAACTGAGACACGGTCCagtcgtgggagacaaggcacacagggg
ataggnnnnn


>44684.3|G1E3M3B01IW|Greengenes|2688 16S ribosomal RNA [Microbacterium oxydans]
gactATAATTTGTAAATTTCTTGAGATAGAATCATTCGTATTGAATGAGGTCAAATTCTC
TAAACTGATTAAGAAGTATAATACTTAGATGCGAGTTATTGCATCACTTAACGGAGAGTT
TGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGTGAAG
TCTGAATTGAGTACTTCGGTATGATATTTGGGTGGAAAGTGGCGGACGGGTGAGTAACAC
GTGGGTAACCTGCCTCGAAGTGGGGACAACCATTGGAAACGATGGCTAATACCGCATAGT
TCTTTAGATGCATGAGCATTTATAGATAAAACTCTGGTGCTTCGAGAGGGGTCTGCGTCC
GATTAGTTAGTTGGTGGGTAAAGGCCTACCAAGACGATGATCGGTAGCTGGTCTGAGAGG
ACGATCAGTCACACGGGAACTGAGACACGGTCCagtcgtgggagacaaggcacacagggg
ataggnnnnn
fasta python list sequence • 5.5k views
ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by hicsuntdrac0nis220
8
gravatar for Damian Kao
7.1 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

So basically you just want to parse a fasta file and put the contents in a header array and a sequence array? You can use BioPython's SeqIO module:

from Bio import SeqIO
import sys

headerList = []
seqList = []

inFile = open(sys.argv[1],'r')
for record in SeqIO.parse(inFile,'fasta'):
   headerList.appendrecord.id)
   seqList.append(str(record.seq))

If you don't want to use BioPython you can:

import sys

inFile = open(sys.argv[1],'r')

headerList = []
seqList = []
currentSeq = ''
for line in inFile:
   if line[0] == ">":
      headerList.append(line[1:].strip())
      if currentSeq != '':
         seqList.append(currentSeq)

      currentSeq = ''
   else:
      currentSeq += line.strip()

seqList.append(currentSeq)
ADD COMMENTlink written 7.1 years ago by Damian Kao15k
1
gravatar for ALchEmiXt
7.1 years ago by
ALchEmiXt1.9k
The Netherlands
ALchEmiXt1.9k wrote:

When I wrote a FASTA reshuffle routine some time ago (to layout contigs as mapped by BLAT) I used Perl and (if memory permits) just threw all sequences itself in a hash with the key being the unique fasta header/identifyer.

For reshuffle and display in particular order I just had to skip through some ordered list of fasta headers and get the corresponding sequences from the hash.

In Python I think you can relatively easily do the same using the python alternative to hashes..dictionaries (I believe...not a python export). Together with the clues of Robert you should get there. Even though I usually prefere regular expressions but thats just preference.

ADD COMMENTlink written 7.1 years ago by ALchEmiXt1.9k
1
gravatar for hicsuntdrac0nis
7.1 years ago by
hicsuntdrac0nis220 wrote:

i needed to reset the string

def Name_Organizer(FASTA,output):

import os
import re

in_file=open(FASTA,'r')
dir,file=os.path.split(FASTA)
temp = os.path.join(dir,output)
out_file=open(temp,'w')

data=''
name_list=[]
seq_list=[]

for line in in_file:

    line=line.strip()
    for i in line:
        if i=='>':
            name_list.append(line)
            if data:
                seq_list.append(data)
                data=''
            break
        else:
            line=line.upper()
    if all([k==k.upper() for k in line]):
        data=data+line

print seq_list
ADD COMMENTlink written 7.1 years ago by hicsuntdrac0nis220
0
gravatar for Robert Ernst
7.1 years ago by
Robert Ernst60
Rotterdam, The Netherlands.
Robert Ernst60 wrote:

You are going in the right direction! You should add a piece of code that puts your data string into a "sequence_list" when the line starts with a ">". And further more to make sure that there are no empty data objects added to your sequence_list you should put this in a if statement (if data != ''). After doing this you want to reset your data object (data = '').

Furthermore your 2nd for loop (for i in line) is not required. You can just check the ">" with line[0] == >.

I hope that helps you to solve the problem yourself.

ADD COMMENTlink written 7.1 years ago by Robert Ernst60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 676 users visited in the last hour