Question

Parsing A Fasta File By Columns

2

Entering edit mode

11.9 years ago

RossCampbell ▴ 140

I'm trying to generate a position-weight matrix (PWM) from a Fasta file of aligned sequences. In order to do this, I need to count the occurence of a specific base at each position in the alignment. For example, given the following alignment:

ATGG
TTGG
ATTG
CTTG

I would need to somehow observe the fact that, in the first column, there are 2 A's, 1 T, and 1 C; and repeat this for every position (column) in the file. At the moment, I'm inclined to read each line, base by base, into a multidimensional array, and parse it by position. However, that seems like it could get computationally demanding if we get large sequences. Has anyone come across this before and come up with a more streamlined approach? I'm coding it in Perl, but if you have something useful in another language, I'll take it.

perl • 4.3k views

ADD COMMENT • link updated 11.9 years ago by Weronika ▴ 300 • written 11.9 years ago by RossCampbell ▴ 140

score 8 · Answer 1 · 2012-06-14

Give python a try:

from Bio import SeqIO
# get parser
parser   = SeqIO.parse( open("fasta_file_name"),'fasta' )
# get dictionary of sequences
seqDict = SeqIO.to_dict( parser )
# get seq into list
seqs = [ str(s.seq) for s in seqDict.itervalues() ]
#assume sequence are the same length
for i in range( len(seqs[0]) ):
 nucleotides = [ seq[i] for seq in seqs ]
 #do whatever you want

score 1 · Answer 2 · 2012-06-16

Or if you wanted to avoid having all the sequences in memory at once (if your file is very large) - another python option that processes the file sequence by sequence and adds up the base frequencies:

from Bio import SeqIO

# first just get the first sequence, and set up a list of base:count dictionaries based on its length
with open("fasta_file_name") as FILE:
    parser = SeqIO.parse(FILE, 'fasta' )
    seq1 = parser.next()
    base_counts = [{'A': 0, 'C': 0, 'T': 0, 'G': 0} for _ in str(seq1.seq)]

# now reopen the file and iterate over all sequences, incrementing the base counts as needed
with open("fasta_file_name") as FILE:
    for seq in SeqIO.parse(FILE, 'fasta' ):
        for position,base in enumerate(str(seq.seq.upper())):
           base_counts[position][base] += 1

In the end you'll have a list of base-count dictionaries per position, like this:

>>> base_dict[0]
{'A': 1, 'C': 4, 'T': 2, 'G': 3}
>>> base_dict[1]
{'A': 1, 'C': 0, 'T': 0, 'G': 9}