Question: How Can I Get Coding Sequences For A List Of Genes?
4
gravatar for David B
8.5 years ago by
David B70
David B70 wrote:

I am working with bacterial genomes and I have a long list of genes, from different organisms.

The list format is replicon accessions (e.g. NC12345) and locus tag (e.g. Abc1234). I have no other identifiers for the gene.

Most of the genes are coding genes, but not all. I would like to get the coding sequence for each gene that has one.

how can I do that (preferably in Perl)?

Best, Dave

perl sequence • 5.3k views
ADD COMMENTlink written 8.5 years ago by David B70
10
gravatar for Neilfws
8.5 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

You have 2 problems to solve here: fetching the sequences and parsing them to extract CDS.

To deal with the second problem: install Bioperl, if you have not already done so. Then, take a look at the SeqIO how-to. If you installed the accessory scripts, there's a handy utility named bp_extract_feature_seq, which you can run like this:

bp_extract_feature_seq -i NC_005213.gb --format genbank --feature=CDS -o NC_005213.fa

It will write a fasta file containing the coding sequences of all CDS features.

You'll want to automate the process of fetching sequences by looping through the replicon accessions. Here's some sample code from the Bioperl tutorial which will fetch a sequence from RefSeq and write it in GenBank format:

#!/usr/bin/perl

use strict;
use Bio::Perl;

my $seq_object = get_sequence('refseq', "NC_005213");
write_sequence(">NC_005213.gb", 'genbank', $seq_object);

It should not be too hard to write a loop into that, using the NC_* accessions from your file.

ADD COMMENTlink written 8.5 years ago by Neilfws48k

+1 Thanks! Just one more thing. Is there an equivalent perl function to bp_extract_feature_seq? I would like to this all from a perl script, and it would be nicer to call a function insteast of using system("bp_extract_feature_seq",..)

ADD REPLYlink written 8.5 years ago by David B70

+1 Thanks! Just one more thing. Is there an equivalent perl function to bp_extract_feature_seq? I would like to this all from a perl script, and it would be nicer to call a function instead of using system("bp_extract_feature_seq",..)

ADD REPLYlink written 8.5 years ago by David B70

+1 Thanks! Just two more things: 1)Is there an equivalent perl function to bp_extract_feature_seq? I would like to this all from a perl script, and it would be nicer to call a function instead of using system("bp_extract_feature_seq",..) 2) How do I get the CDS for a specific locus tag (not the entire CDSs in the replicon)?

ADD REPLYlink written 8.5 years ago by David B70

I would take a look at the code in bp_extract_feature_seq - it's just a perl script. I only mentioned it for the convenience, but it should be easy to build something around SeqIO, if you study the how-to.

ADD REPLYlink written 8.5 years ago by Neilfws48k
5
gravatar for iw9oel_ad
8.5 years ago by
iw9oel_ad6.0k
iw9oel_ad6.0k wrote:

An addendum to Jorge Amigo's answer. Use BioMart as he describes. Additionally, set the gene type to be 'protein_coding'. Then click the 'Perl' button on the top right. Edit the script that it returns to include a loop over your set of organisms. e.g.

$query->setDataset("<my organism>");
$query->addFilter("refseq_peptide", ["<id 1>","<id 2>"]);
$query->addFilter("biotype", ["protein_coding"]);

You may have to normalise your IDs so that they are all of the same type (e.g. Refseq)

ADD COMMENTlink written 8.5 years ago by iw9oel_ad6.0k

indeed Keith. thumbs up! editing the Perl code that BioMart definitely increases even more its query capabilities. looping through the set of organisms David has like you describe should be straight-forward.

ADD REPLYlink written 8.5 years ago by Jorge Amigo11k
4
gravatar for Jorge Amigo
8.5 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

you can use BioMart to retrieve such information. I would select the latest version of the Ensembl Genes database (currently #59) and then select the appropriate organism. you will then only have to enter the list of IDs you have, using the "ID list limit" filter and selecting from the combo box the type of IDs you are going to enter for that particular organism. you will have to do it organism by organism though.

ADD COMMENTlink written 8.5 years ago by Jorge Amigo11k

arrr... this will not work for me as I have dozens of organisms.

ADD REPLYlink written 8.5 years ago by David B70

I've voted Keith's answer, as I think that it'll surely solve your problems easily. good luck ;)

ADD REPLYlink written 8.5 years ago by Jorge Amigo11k
2
gravatar for Pierre Lindenbaum
8.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

From the NCBI, you can use ESearch (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/esearch_help.html ) and/or EFetch ( http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html ) to retrieve an XML description of your record (search biostar for some examples).

For example: http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=5&retmode=xml

Then, using XSLT/ XPATH, you can extract the CDS (again, search biostar for some examples).

XPATH example:

/GBSet/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='translation']/GBQualifier_value

but I'm afraid the annotation of the NCBI are not homogeneous, so the word 'translation' might not be the only one to match the CDS.

ADD COMMENTlink written 8.5 years ago by Pierre Lindenbaum118k

+1 Thanks, it's great to know this option exists.

ADD REPLYlink written 8.5 years ago by David B70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1841 users visited in the last hour