How to retrieve gene sequences and their products from NCBI's FTP?
3
0
Entering edit mode
9.6 years ago
fr ▴ 210

Hello everyone!

I'm analysing proteomes and genomes of many species using NCBI as my main database. Currently, I'd like to retrieve the gene sequences that code for each protein. Take, for instance, the case of Acidobacterium capsulatum ATCC 51196. You can find information about this organism in here:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000022565.1_ASM2256v1/

The file GCA_000022565.1_ASM2256v1_protein.faa.gz has the protein sequences in this genome, and the file GCA_000022565.1_ASM2256v1_genomic.fna.gz has the full genome of the organism. However, I'd like to find a FASTA file with the genes that code for each of the proteins. Something like:

>ACO31298.1 cellulose synthase operon protein YhjQ [Acidobacterium capsulatum ATCC 51196]
GTCCCATGGAATCAATAGCTTACAGGACGTTAATGTGTGTTTTTTTAGCGCGCTCAGACCTCTCCGCAAGCACTCACTCC
AGGCCTGAAATGCCCTGCTACTCAAGCACCGCCCCGCTACTCCAGCACTAAAGTCACATACAGTTGCCGGGGCCCATGCA
CGCCGCGAATGCGGGTCATCTCAATATCGGCCGTGGCTGACGGACCGGAAAT...

(made up example, but you get the idea!)

Is there such a file? Where? I understand that this is a very basic question, but I've been searching the FTP and can't find it anywhere... I also searched in this website for a question like this but couldn't find anything either... I apologize if this question has been asked before!

Thanks for your help!

genome gene • 8.0k views
ADD COMMENT
4
Entering edit mode
9.6 years ago
rwn ▴ 590

Script for extracting CDSs from full genbank files. Run it without any arguments to get a little usage message.

#!/usr/bin/perl -w

use strict;

use Bio::SeqIO;
use Text::Wrap;

if (@ARGV == 0) { die "
\tExtracts all the CDS nucleotide sequences from a GenBank file (*.gbk). 
\tNB must be a 'full' gbk file, with the source sequence at the end.\n
\tUSAGE: genbank2fasta.pl <myGenbankFile> > myFastaFile.fasta\n
"}

## wraps sequence string every 80 characters:
$Text::Wrap::columns = 80;
$Text::Wrap::separator = "\n";

my $in = Bio::SeqIO -> new( -file => $ARGV[0], -format => 'genbank');

my $i;

while (my $seq = $in->next_seq) {
    for my $feat ($seq -> get_SeqFeatures) {
        if ($feat->primary_tag eq "CDS") {
            print "\>";
            ## construct fasta header based on information present for each CDS:
            if ( $feat->has_tag('locus_tag') ) {
                print $feat->get_tag_values('locus_tag'),"\|";    
            } 
            if ( $feat->has_tag('db_xref') ) {
                print $feat->get_tag_values('db_xref') ,"\|";    
            } 
            if ( $feat->has_tag('protein_id') ) {
                print $feat->get_tag_values('protein_id'),"\|";    
            } 
            if ( $feat->has_tag('gene') ) {
                print $feat->get_tag_values('gene'),"\|";            
            } 
            if ( $feat->has_tag('product') ) {
                print $feat->get_tag_values('product');
            }
            ## print sequence:
            print "\n",wrap("","",$feat->seq->seq),"\n";
            $i++;
        }
    }
}

print STDERR "\n\tExtracted $i sequences from file $ARGV[0]\n\n";

__END__
ADD COMMENT
0
Entering edit mode

This looks like a neat script. Do you mind if I host this on my GitHub repo? (with due attribs to you of course)

Also, fr's use case found a teensy edge-case bug in the script: If the CDS fragments for a 1000nt gene look like join(990..1000,1..10), we're gonna end up extracting 1..1000. I think the script might be safer if we used

print "\n",wrap("","",$feat->spliced_seq->seq),"\n";

instead of

print "\n",wrap("","",$feat->seq->seq),"\n";

What do you think?

ADD REPLY
1
Entering edit mode

Help yourself! Glad it is of use :) And yes, you're right about the multiple exons - I work with bacteria where that is rarely an issue, but definitely a good idea to call spliced_seq if you're gonna make the script more generalised. Thanks for pointing that out.

ADD REPLY
0
Entering edit mode

Thank you! I was thinking this would be an issue in bacterial circular sequences, no? From what I gather, that's where the representation might contain a join(R1,R2) with ranges R1 and R2 where R1.start > R2.start

Like so:

Eukaryotes: join(100..135, 167..209)

Circular: join(1345..1466,12..39)

Would love to hear your thoughts on this!

ADD REPLY
1
Entering edit mode

You mean if a CDS spanned the very first base of a circular genome? Such that, for a 1 kb gene half of which spans the origin of a 5 Mb genome, you might have join(4999500..5000000, 1..500)? The first base is usually at (or very near to) the origin of replication, which is usually upstream of the dnaA gene. So if the genome has been annotated correctly I don't think you would expect a CDS spanning the terminus/origin of the circular molecule. You do get the odd 'join' terminology in bacterial genomes, but they are usually in cases where frameshifts (either real or error) have meant that the annotated gene needs to be stitched together from two open reading frames rather than just one. There are things that do act a bit like introns in bacteria though, as discussed here. Apologies if that's not what you meant.

ADD REPLY
0
Entering edit mode

That is almost what I was talking about. Thank you for the info - it helps in understanding bacterial genomes better.

I was not thinking of genes that span the ends, actually. I was thinking more along the lines of what if the join(1..X, M..N) where 1<X<M<N is written in the order join(M..N,1..X) instead, but I guess that would just be a broken format that would never be an actual usage scenario.

ADD REPLY
0
Entering edit mode
9.6 years ago
rwn ▴ 590

Go to the NCBI homepage, and search for 'Acidobacterium capsulatum ATCC 51196' across all databases. Click on the results under 'Nucleotide' (should be 12 results). Number 1 is 'Acidobacterium capsulatum ATCC 51196, complete genome'. Click it and you'll see the GenBank main entry for this genome. Up in the top right corner is a drop down menu entitled 'Send'. Click Send > Coding sequences > FASTA nucleotide > Create file. You should end up with a fasta file containing the nucleotide sequences for all CDSs in the genome :)

ADD COMMENT
0
Entering edit mode

This retrieves the full genome, which is not what I need. Also, is there any way to do this from the FTP? I'm only asking this because I'm doing this for ~1200 organisms

ADD REPLY
1
Entering edit mode

Ah! Sorry. I'm not sure about that via FTP. You could download the full GenBank files for all your organisms and then extract the CDS from them. I've got a Perl script that does exactly that, you could modify it to iterate across all of your *.gbk files. Let me know if you'd like it -- uses BioPerl's Feature-Annotation methods.

ADD REPLY
0
Entering edit mode

Looking into this! Thanks!

Edit: I don't think I'll be able to do anything from the Genbank (or Refseq). This because the directory that I was using (the /all) has information from both Genbank and Refseq, and they do not hold this information that I was looking for (I think). Oh boy...

ADD REPLY
0
Entering edit mode

Not quite sure what you mean but the full genbank file is in the /all directory: GCA_000022565.1_ASM2256v1_genomic.gbff.gz. So you'd need to unpack it then extract the CDS from it using the perl script. But I guess the trickier bit is acquiring the ~1,200 genbank files without going crazy...

ADD REPLY
0
Entering edit mode

Ah! Ok, just to make sure: you mean that I should look in the .gbff file and then, for each CDS there I would extract the gene sequence to which it would correspond from the full genome? For the first one that reads

FEATURES             Location/Qualifiers
     source          1..4127356
                     /organism="Acidobacterium capsulatum ATCC 51196"
                     /mol_type="genomic DNA"
                     /strain="ATCC 51196"
                     /culture_collection="ATCC:51196"
                     /db_xref="taxon:240015"
                     /note="type strain of Acidobacterium capsulatum"
     gene            complement(117..764) # this line
                     /locus_tag="ACP_0001"
     CDS             complement(117..764)
                     /locus_tag="ACP_0001"
                     /note="YkgG family protein AMBIGUITY; this gene contains a
                     premature stop which is not the result of sequencing
                     error; identified by match to protein family HMM PF02589"
                     /codon_start=1
                     /transl_table=11
                     /product="YkgG family protein"
                     /protein_id="ACO32057.1"
                     /db_xref="GI:225791967"
                     /translation="MTQSEARAEILARIQSAVKGNDDRYEPEDSPANYRRAGTLTLPE
                     RLDLFLERLREYDAQVMSVKAGDVASTIAGVLREADEAALPWVVADGFPAEWLARSPA
                     VMPESEAGVDELDRCAGVVTACSVGIAVTGTLVLRHGAGEGRRRTSLIPDRHLCVVRA
                     SQVVETVPEAFARLADAATEPLTLISGPSATADIEMTRIRGVHGPRQLYVTLVLE"

This would mean that I would extract the nucleotides 117 to 764 from the .fna.gz file?

I hope this makes sense, and I'm sorry if I'm being confusing

ADD REPLY
1
Entering edit mode

You're on the right lines, except there should be no need to use the .fna file if you use 'full' genbank files, ie. that have the full genome sequence at the end of the file (under the heading 'Source'). For example, the file GCA_000022565.1_ASM2256v1_genomic.gbff is a full genbank file since it contains the full genome sequence at the end of it. Of course, for the gene you highlighted, the corresponding gene sequence is the reverse complement of the bases from 117 to 764 inclusive, but using bioperl's genbank parsers will keep you right.

ADD REPLY
0
Entering edit mode

Very well! Thank you so much! I'll look into this now!

(to be honest, I'd expect to find such a file directly at NCBI, but this would just be too easy right? :) ;)

I don't know if you'd like to edit your answer so that I can accept it?

Thank you once again

ADD REPLY
1
Entering edit mode

Well I'm not sure it does answer your original question to be honest! But I'll post my script for extracting CDS from genbank files below and you can have a look at it. Good luck!

ADD REPLY
0
Entering edit mode

The code works wonderfully! Now I'm working on getting it to do this for all ~1100 genomes that I'm analizing. Thank you so much for this!

Now, I just want to understand everything it is happening in here. I went to read on CPAN about Bio::Seq, and I have a couple doubts. I hope I'm not being a pain in the ass by asking these [simple] questions, and please feel free to simply post a link where I can read about these. Any way, thank you so much for your help

  1. Just like it says in CPAN, you said that I needed to find the "reverse complement" sequences. Why "reverse"? I was thinking that the gbk files simply had the protein sequences, and at the bottom the complementary genome sequence (introns and exons) that would directly correspond to those proteins;
  2. What is the difference between a .gbk file, and a .gff (the genebank flat file)? I am using the .gff files.

Thank you again!

ADD REPLY
0
Entering edit mode

Glad it's helpful :)

  1. The nucleotide information at the end of a full genbank file is the full, 'raw' genome sequence from which all the features (eg., CDSs etc) listed above it are derived (hence the label 'Source'). The source nucleotide sequence data itself is of course the textual representation of only one strand of the actual DNA molecule that has been sequenced - open reading frames may be found on either strand. Annotation algorithms obviously account for this and simply look for genes on both the sequence you provide and it's reverse complement, since transcription on the opposite strand will proceed in the opposite direction. But SeqIO should account for this - so when it comes across a "complement(gene_start..gene_end)" it will print the reverse complement of that segment of source DNA, and output all the CDSs in the same 5' to 3' orientation. (I think that is correct anyway.)

  2. Well, .gbk files and .gff files are quite different - however I think you are probably using a *.gbff (confusing I know!). gbk stands for genbank, gff stands for general feature format and finally gbff I assume is simply genbank flat file (or perhaps full file? I don't know...). Thus gbk and gbff are probably the same thing, although sometimes gbk files are just summary files, with no data on CDS or source sequences. Gff is another format altogether, used to summarise the locations of features (genes, repeats, chromosomes... defined as anything you want) within a larger source sequence.

There are any number of file formats used in the field now, each with their own pros and cons and uses. It just takes a while to get familiar with them all!

ADD REPLY
0
Entering edit mode

Thanks! Everything made sense! And you are absolutely right, I'm using .gbff (standing for genbank flat file according to the documentation on NCBI). Thank you, you were very helpful!

ADD REPLY
0
Entering edit mode

Sorry to come back to this. I am trying to understand how Bio::SeqIO (and Bio::Seq) work and how these modules get all that they get from the file. I've been dumping parts of your code in your answer above and I can see that these "just" build hashes of hashes and of arrays throughout. However, I could only see how this was done after looking at the dumps, not at the documentation on CPAN.

1. Is there any advice that you could give me to better understand how these modules build these HoH?

2. I can see from that documentation shows how you can use the keys and values to get the information. But, for instance, I don't understand how doing $feat->seq->seq gets the sequence, even though I see in the dump of $feat that you have a 'seq'=> 'ACGACAGA...' key|value, what I see in the dump is:

$Var1= \bless( { ... , 
    '_gsf_seq' => bless ( { ... ,
        'seq'=> 'ACGACAGA...' } ) 
} )

So, though the code works, shouldn't that instruction be $feat->gsf_seq->seq instead?

3. How is it that the module goes and gets the sequence from the bottom? I can see it is doing it, but I would expect this to have to be coded outside of the module (i.e. by us) and not directly in the module.

4. So, if I wanted to output the protein seq, it's just $feat -> seq -> translation.

Again, I'm sorry about the simplicity of these questions.

ADD REPLY
0
Entering edit mode

I think you are starting to get into the realm of 'not-so-simple' questions now! I can't claim to be an expert in the inner workings of the Bio::Seq modules I'm afraid, so you may be asking the wrong person.

  1. The only advice I have is to continue with what you appear to be doing already - that is, poking around in the guts of the modules and trying to figure out where/how everything joins up. If you are really stuck/desperate to know something, I would email the authors of the modules themselves (usually given on the CPAN help pages).
  2. In the case of $feat->seq->seq, my understanding is that the seq within $feat is a Bio::Seq object, thus needs to be translated into a string of nucleotides via a call to the seq method. So, the first seq is a key within $feat, the second is a call to the seq sub routine that turns it into a string. I don't know about _gsf_seq, I imagine it is a subroutine within Bio::Seq that is called internally during the process. It's usually possible to trace back where these things come from if you look for _gsf_seq within the relevant *.pm files.
  3. The joy of BioPerl modules! Someone else has coded it up, so we don't have to. I I think it's a simple case of splicing the source seq at the bottom of the gbk for each CDS. I wouldn't worry about it ;)
  4. You could do that, although a better way would be to pull the protein seqs directly from the genbank file itself - note they are all there under the \translation tag value, so to get AA seqs modify line #42 thus:

    print "\n",wrap("","",$feat->get_tag_values('translation')),"\n"; ## get aa seqs
    

    although you would probably want to modify the fasta headers a bit.

ADD REPLY
0
Entering edit mode

Thanks for all the input! Regarding your last comment, I had first tried what I suggested (and failed) and then did as you suggested. It seems like getting the translated sequences really requires that get_tag_values bit.

ADD REPLY
0
Entering edit mode

Coming back to this thread after a few problems - that are not related to what you answered. I have downloaded and processed many .gbff with success, extracting features, CDS and products. However, many of them are lacking the genome (as I have posted in here). Do you see any way to retrieve this besides what is suggested in the Perlmonks' page?

ADD REPLY
0
Entering edit mode
8.2 years ago
BalderNGS • 0

OK, a bit late, but why not use the emboss tools (sudo apt-get install emboss on ubuntu)

http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/extractfeat.html

example: extractfeat -sequence <your genbank> -type gene -outseq <your fastafile> -describe locus_tag

will extract all gene nucleotide sequences (1,6sec for your Acidobacterium capsulatum strain)

ADD COMMENT

Login before adding your answer.

Traffic: 2998 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6