Question

Select records from fasta file by SeqIO (Biopython)

0

Entering edit mode

6.1 years ago

dmitri.ivanovsky • 0

Hi all! Please help. I parsed sequences from GenBank, renamed it and saved as a fasta file.

>KP821216.1_Bluetongue v_Cameroon_Jan-1982
ATGGCTGCTCAGAATGAGCAACGTCCGGAGCGAATAAAAACGACACCGTATTTAGAGGGA
GATGTGCTTTCGAGTGATTCAGGACCGCTGCTTTCCGTGTTCGCGCTGCAAGAAATAATG

The last 4 characters is a year when the viruse was isolated. Now I need to select the only records that are in the some range (for example 1958-1990):

from Bio import SeqIO

output_file = open("range_date_select.txt", "w")

date_from = 1958
date_to = 1990 
count = 0
for i, record in enumerate(SeqIO.parse("Bluetong_batch_cds.txt", "fasta")):
     a = record.description[-4:]
     if date_from <= int(a) <= date_to:
         SeqIO.write(record, output_file, "fasta")
         count = count + 1

print(count)  
output_file.close()

Further the task becomes more complicated: I need not more 4 records for the year. If its number is more, 4 records should be chosen randomly.

Can anybody help me how to do this? Thanks in advance.

sequence SeqIO Biopython fasta • 3.8k views

ADD COMMENT • link updated 6.1 years ago by Andrzej Zielezinski 11k • written 6.1 years ago by dmitri.ivanovsky • 0

score 4 · Accepted Answer · 2018-03-24

4

Entering edit mode

6.1 years ago

Andrzej Zielezinski 11k

This will do the job:

import random
from Bio import SeqIO

d = {}
year_st = 1958
year_en = 1990 
for record in SeqIO.parse("Bluetong_batch_cds.txt", "fasta"):
    year = int(record.description[-4:])
    if year >= year_st and year <= year_en:
        if year not in d:
            d[year] = []
        d[year].append(record)


output_file = open("range_date_select.txt", "w")
for year in sorted(d.keys()):
    random.shuffle(d[year])
    for i, record in enumerate(d[year]):
        if i < 4:
            output_file.write(record.format("fasta"))
output_file.close()

ADD COMMENT • link 6.1 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Thanks a lot! It works very well!

But could you please explain why this works:

for year in sorted(d.keys()):
    random.shuffle(d[year])
    for i, record in enumerate(d[year]):
        if i < 4:
            output_file.write(record.format("fasta"))

If there are more than 4 records in the list "d[year]" , it shouldn't be recorded because the condition "if i < 4" is not met? But its are written down. I'm a newbie in python so I know this is probably a very basic question.

ADD REPLY • link 6.1 years ago by dmitri.ivanovsky • 0

1

Entering edit mode

Yes, you are right. In the first for loop I iterate over each year. Then, I shuffle the list d[year] to make sure you have a random order of sequences for that year. At this point, d[year] contains all sequences for a given year (there may be more than 4 sequences). In the second for loop I iterate over each sequence record in the d[year] list and counting them - as iteration goes - from 0 to numer of sequences in d[year] list (so the variable i is just a counter). For first sequence i is 0, for second i is 1, and so on. So this if i < 4 statement means that only first four sequences in d[year] will be saved in output file. Nothing will be done with fifth (i = 4), sixth (i = 5), nth sequence in the list. If you are satisfied with my answer, please mark it as accepted.

ADD REPLY • link 6.1 years ago by Andrzej Zielezinski 11k