Download multiple bacterial CDS from NCBI using Biopython
0
0
Entering edit mode
4.5 years ago
lisssanka • 0

Hi there, I'm new to Biopython and I've stalled with the following task.

I need to get a number of coding sequences from different bacterial genomes. The idea is to have a rather big dataset of different coding sequences. I need not less than 7000 coding sequences.

I don't care about the exact species but they must be not very related to each other. So, I think I need to take a number of bacterial genomes (each of them should be from 1 species of different genera) and retrieve somehow all of their CDSs. Maybe then I also need to filter them from homologs.

I tried to implement Biopython .esearch() and .efetch() methods but I can't invent a good way to retrieve my data with the conditions above. I'll appreciate any help and suggestions!

CDS Biopython NCBI • 995 views
ADD COMMENT
1
Entering edit mode

You could perhaps try downloading representative sequences, and, depending on how many there are, maybe take a random selection of accessions as a surrogate.

7000 coding sequences isn't very many for bacterial genomes. A pretty normal enterobacteria like E. coli has between 4-5000 off the bat.

To my knowledge, there's no very easy way to get a single representative from each genera. This is a question that has come up for another excellent tool, ncbi-genome-download, and there isn't a simple answer other than taking a genome at random AFAICT.

A 'rigorous' approach might be to download a large selection of genomes, and calculate a whole load of pairwise mash distances to figure out the most distantly related ones, then select those to 'harvest' CDSs from. This would probably take a good while and quite a bit of disk space/memory though.

ADD REPLY
0
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 2898 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6