Select records from fasta file by SeqIO (Biopython)
1
0
Entering edit mode
6.1 years ago

Hi all! Please help. I parsed sequences from GenBank, renamed it and saved as a fasta file.

>KP821216.1_Bluetongue v_Cameroon_Jan-1982
ATGGCTGCTCAGAATGAGCAACGTCCGGAGCGAATAAAAACGACACCGTATTTAGAGGGA
GATGTGCTTTCGAGTGATTCAGGACCGCTGCTTTCCGTGTTCGCGCTGCAAGAAATAATG

The last 4 characters is a year when the viruse was isolated. Now I need to select the only records that are in the some range (for example 1958-1990):

from Bio import SeqIO

output_file = open("range_date_select.txt", "w")

date_from = 1958
date_to = 1990 
count = 0
for i, record in enumerate(SeqIO.parse("Bluetong_batch_cds.txt", "fasta")):
     a = record.description[-4:]
     if date_from <= int(a) <= date_to:
         SeqIO.write(record, output_file, "fasta")
         count = count + 1

print(count)  
output_file.close()

Further the task becomes more complicated: I need not more 4 records for the year. If its number is more, 4 records should be chosen randomly.

Can anybody help me how to do this? Thanks in advance.

sequence SeqIO Biopython fasta • 3.8k views
ADD COMMENT
4
Entering edit mode
6.1 years ago

This will do the job:

import random
from Bio import SeqIO

d = {}
year_st = 1958
year_en = 1990 
for record in SeqIO.parse("Bluetong_batch_cds.txt", "fasta"):
    year = int(record.description[-4:])
    if year >= year_st and year <= year_en:
        if year not in d:
            d[year] = []
        d[year].append(record)


output_file = open("range_date_select.txt", "w")
for year in sorted(d.keys()):
    random.shuffle(d[year])
    for i, record in enumerate(d[year]):
        if i < 4:
            output_file.write(record.format("fasta"))
output_file.close()
ADD COMMENT
0
Entering edit mode

Thanks a lot! It works very well!

But could you please explain why this works:

for year in sorted(d.keys()):
    random.shuffle(d[year])
    for i, record in enumerate(d[year]):
        if i < 4:
            output_file.write(record.format("fasta"))

If there are more than 4 records in the list "d[year]" , it shouldn't be recorded because the condition "if i < 4" is not met? But its are written down. I'm a newbie in python so I know this is probably a very basic question.

ADD REPLY
1
Entering edit mode

Yes, you are right. In the first for loop I iterate over each year. Then, I shuffle the list d[year] to make sure you have a random order of sequences for that year. At this point, d[year] contains all sequences for a given year (there may be more than 4 sequences). In the second for loop I iterate over each sequence record in the d[year] list and counting them - as iteration goes - from 0 to numer of sequences in d[year] list (so the variable i is just a counter). For first sequence i is 0, for second i is 1, and so on. So this if i < 4 statement means that only first four sequences in d[year] will be saved in output file. Nothing will be done with fifth (i = 4), sixth (i = 5), nth sequence in the list. If you are satisfied with my answer, please mark it as accepted.

ADD REPLY

Login before adding your answer.

Traffic: 2899 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6