Obtaining sequence from Bioproject IDs using biopython gives unknown sequence
1
0
Entering edit mode
8.9 years ago
Prasad ▴ 50

Hi All,

I have a list of bioproject IDs and would like to get corresponding sequences from them. So, I am following a list of steps as below:

1. Using the bioproject ID, I am getting GI ID using elink:

handle = Entrez.elink(dbfrom="bioproject", db="nuccore",id=bioprojecID, linkname="bioproject_nuccore_wgsmaster")
record = Entrez.read(handle)
GI_ID = record[0]["LinkSetDb"]["Link"]["Id"]

2. Then I am trying to get sequence from GI_ID (using efetch and seqIO modules in biopython):

handle = Entrez.efetch(db="nucleotide", id=GI_ID, rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")

But this gives unknown sequence when trying to print record.

Can anyone advise if this is the right way to do it or is there a better way to obtain related sequences from bioproject IDs?

Thanks in advance!

efetch elink biopython eutilities • 4.1k views
ADD COMMENT
0
Entering edit mode
8.9 years ago

I can help with SeqIO part. Assuming that your "handle" is a genbank file.

from Bio import SeqIO

for record in SeqIO.parse(open(handle), 'genbank'):
    print record.id, record.seq

For more options do this:

print dir(record)
break

This will return a list of methods you can call on record object - that way you can get different information about your file (handle)

ADD COMMENT
0
Entering edit mode

Hi, thanks for replying. I tried printing record.seq but it gives weird output (multiple 'N' characters).

ADD REPLY
0
Entering edit mode

It is very common to have multiple 'N' characters at the start of the sequence. Each chromosome may have multiple Ns at the start of the chromosome (could be 100 or 1000 of bases long). Scroll down into your sequence.

ADD REPLY

Login before adding your answer.

Traffic: 1468 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6