Select records from fasta file by SeqIO (Biopython)
1
0
Entering edit mode
3.6 years ago

Hi all! Please help. I parsed sequences from GenBank, renamed it and saved as a fasta file.

>KP821216.1_Bluetongue v_Cameroon_Jan-1982
ATGGCTGCTCAGAATGAGCAACGTCCGGAGCGAATAAAAACGACACCGTATTTAGAGGGA
GATGTGCTTTCGAGTGATTCAGGACCGCTGCTTTCCGTGTTCGCGCTGCAAGAAATAATG


The last 4 characters is a year when the viruse was isolated. Now I need to select the only records that are in the some range (for example 1958-1990):

from Bio import SeqIO

output_file = open("range_date_select.txt", "w")

date_from = 1958
date_to = 1990
count = 0
for i, record in enumerate(SeqIO.parse("Bluetong_batch_cds.txt", "fasta")):
a = record.description[-4:]
if date_from <= int(a) <= date_to:
SeqIO.write(record, output_file, "fasta")
count = count + 1

print(count)
output_file.close()


Further the task becomes more complicated: I need not more 4 records for the year. If its number is more, 4 records should be chosen randomly.

Can anybody help me how to do this? Thanks in advance.

sequence SeqIO Biopython fasta • 2.8k views
4
Entering edit mode
3.6 years ago

This will do the job:

import random
from Bio import SeqIO

d = {}
year_st = 1958
year_en = 1990
for record in SeqIO.parse("Bluetong_batch_cds.txt", "fasta"):
year = int(record.description[-4:])
if year >= year_st and year <= year_en:
if year not in d:
d[year] = []
d[year].append(record)

output_file = open("range_date_select.txt", "w")
for year in sorted(d.keys()):
random.shuffle(d[year])
for i, record in enumerate(d[year]):
if i < 4:
output_file.write(record.format("fasta"))
output_file.close()

0
Entering edit mode

Thanks a lot! It works very well!

But could you please explain why this works:

for year in sorted(d.keys()):
random.shuffle(d[year])
for i, record in enumerate(d[year]):
if i < 4:
output_file.write(record.format("fasta"))


If there are more than 4 records in the list "d[year]" , it shouldn't be recorded because the condition "if i < 4" is not met? But its are written down. I'm a newbie in python so I know this is probably a very basic question.

1
Entering edit mode

Yes, you are right. In the first for loop I iterate over each year. Then, I shuffle the list d[year] to make sure you have a random order of sequences for that year. At this point, d[year] contains all sequences for a given year (there may be more than 4 sequences). In the second for loop I iterate over each sequence record in the d[year] list and counting them - as iteration goes - from 0 to numer of sequences in d[year] list (so the variable i is just a counter). For first sequence i is 0, for second i is 1, and so on. So this if i < 4 statement means that only first four sequences in d[year] will be saved in output file. Nothing will be done with fifth (i = 4), sixth (i = 5), nth sequence in the list. If you are satisfied with my answer, please mark it as accepted.