Question

How Can I Get Coding Sequences For A List Of Genes?

4

Entering edit mode

13.6 years ago

David B ▴ 70

I am working with bacterial genomes and I have a long list of genes, from different organisms.

The list format is replicon accessions (e.g. NC12345) and locus tag (e.g. Abc1234). I have no other identifiers for the gene.

Most of the genes are coding genes, but not all. I would like to get the coding sequence for each gene that has one.

how can I do that (preferably in Perl)?

Best, Dave

perl sequence • 7.6k views

ADD COMMENT • link updated 13.6 years ago by biobot 0.0.77.a.1099 6.2k • written 13.6 years ago by David B ▴ 70

Ram · Answer 1 · 2010-09-15

10

Entering edit mode

13.6 years ago

Neilfws 49k

You have 2 problems to solve here: fetching the sequences and parsing them to extract CDS.

To deal with the second problem: install Bioperl, if you have not already done so. Then, take a look at the SeqIO how-to. If you installed the accessory scripts, there's a handy utility named bp_extract_feature_seq, which you can run like this:

bp_extract_feature_seq -i NC_005213.gb --format genbank --feature=CDS -o NC_005213.fa

It will write a fasta file containing the coding sequences of all CDS features.

You'll want to automate the process of fetching sequences by looping through the replicon accessions. Here's some sample code from the Bioperl tutorial which will fetch a sequence from RefSeq and write it in GenBank format:

#!/usr/bin/perl

use strict;
use Bio::Perl;

my $seq_object = get_sequence('refseq', "NC_005213");
write_sequence(">NC_005213.gb", 'genbank', $seq_object);

It should not be too hard to write a loop into that, using the NC_* accessions from your file.

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.6 years ago by Neilfws 49k

0

Entering edit mode

+1 Thanks! Just two more things:

Is there an equivalent perl function to bp_extract_feature_seq? I would like to this all from a perl script, and it would be nicer to call a function instead of using system("bp_extract_feature_seq",..)
How do I get the CDS for a specific locus tag (not the entire CDSs in the replicon)?

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.6 years ago by David B ▴ 70

0

Entering edit mode

I would take a look at the code in bp_extract_feature_seq - it's just a perl script. I only mentioned it for the convenience, but it should be easy to build something around SeqIO, if you study the how-to.

ADD REPLY • link 13.6 years ago by Neilfws 49k

Ram · Answer 2 · 2010-09-15

5

Entering edit mode

13.6 years ago

biobot 0.0.77.a.1099 6.2k

An addendum to Jorge Amigo's answer. Use BioMart as he describes. Additionally, set the gene type to be 'protein_coding'. Then click the 'Perl' button on the top right. Edit the script that it returns to include a loop over your set of organisms. e.g.

$query->setDataset("<my organism>");
$query->addFilter("refseq_peptide", ["<id 1>","<id 2>"]);
$query->addFilter("biotype", ["protein_coding"]);

You may have to normalise your IDs so that they are all of the same type (e.g. Refseq)

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.6 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

indeed Keith. thumbs up! editing the Perl code that BioMart definitely increases even more its query capabilities. looping through the set of organisms David has like you describe should be straight-forward.

ADD REPLY • link 13.6 years ago by Jorge Amigo 14k

score 4 · Answer 3 · 2010-09-15

4

Entering edit mode

13.6 years ago

Jorge Amigo 14k

you can use BioMart to retrieve such information. I would select the latest version of the Ensembl Genes database (currently #59) and then select the appropriate organism. you will then only have to enter the list of IDs you have, using the "ID list limit" filter and selecting from the combo box the type of IDs you are going to enter for that particular organism. you will have to do it organism by organism though.

ADD COMMENT • link 13.6 years ago by Jorge Amigo 14k

0

Entering edit mode

arrr... this will not work for me as I have dozens of organisms.

ADD REPLY • link 13.6 years ago by David B ▴ 70

0

Entering edit mode

I've voted Keith's answer, as I think that it'll surely solve your problems easily. good luck ;)

ADD REPLY • link 13.6 years ago by Jorge Amigo 14k

Ram · Answer 4 · 2010-09-15

2

Entering edit mode

13.6 years ago

Pierre Lindenbaum 161k

From the NCBI, you can use ESearch and/or EFetch to retrieve an XML description of your record (search biostar for some examples).

Here is an example.

Then, using XSLT/XPATH, you can extract the CDS (again, search biostar for some examples).

XPATH example:

/GBSet/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='translation']/GBQualifier_value

but I'm afraid the annotation of the NCBI are not homogeneous, so the word 'translation' might not be the only one to match the CDS.

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

+1 Thanks, it's great to know this option exists.

ADD REPLY • link 13.6 years ago by David B ▴ 70