Question: Searching For Proteins In The New Oyster Genome
1
gravatar for cdsouthan
7.2 years ago by
cdsouthan1.8k
cdsouthan1.8k wrote:

I wanted to check the new Crassostrea gigas genome for similarities to a small number of proteins I have an interest in the evolution of. This proved difficult as Crassostrea comes up with an invalid organism selection for TBLASTN vs WGS and the SRA offers only BLASTN.

So is there a way to search for ORF similarities on a small scale or do I have to wait until the assembly gets into UCSC and/or Ensembl ?

genome • 2.5k views
ADD COMMENTlink written 7.2 years ago by cdsouthan1.8k

You could translate the genome in all 6 ORFs and use this as a local database for blast?

ADD REPLYlink written 7.2 years ago by Whetting1.5k
3

No! use tblastx, or tblastn instead. Download the draft assembly - if available - and build your own nucleotide database from it. Then use local blast.

ADD REPLYlink written 7.2 years ago by Michael Dondrup47k

duh...sorry, I must have been sleeping this morning!!

ADD REPLYlink written 7.2 years ago by Whetting1.5k
2
gravatar for Michael Dondrup
7.2 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

The genbank accession according to the Nature paper (Zhang et al.) is AFTI01000000. While the link in the online paper doesn't work, searching ncbi nucleotide (http://www.ncbi.nlm.nih.gov/nuccore?term=AFTI01000000) reveals 7659 entries; download all of them e.g. as FASTA file and create a local Blast database. Then run tblastx or tblastn to search for the genes. In addition you can use available EST sequences and make a blast database from those. Following Ketil, I wouldn't use de-novo gene predictions, the quality of predictions without proper training set and no manual curation will not justify the effort (it will predict anything but the genes you are looking for).

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Michael Dondrup47k
1

I guess what I was getting at before, and this is just MHO, but a broken link to a bunch of contigs in GenBank is really not acceptable for a paper in Nature. I guess I expect more when it comes to genome papers: actual data, annotations, etc. Bioinformatic algorithms and pipelines for validation of presented results in papers. One immense disappointment for me was the lack of a program or algorithm in the Iverson et al Paper in Science this last spring. Here's a great blog post from Titus Brown about anecdotal science both from the Iverson paper and beyond. Kinda feeling like the oyster paper is falling into this category of not providing public resources to advance science.

ADD REPLYlink modified 7.2 years ago • written 7.2 years ago by Josh Herr5.7k

Glad you liked the blog (I was afraid posting the link was going to get this "closed"). I won't reiterate comments on the general patchiness of genome finishing and assemblies languishing for years without updates but in this case you'd have thought BGI would have really wanted to pull this one all the way through to a good set of public ORFs as a SOAPdenovo showcase.

ADD REPLYlink modified 7.2 years ago • written 7.2 years ago by cdsouthan1.8k

Thanks for all the answers. Josh hits the nail on the head, Nature (and the referees) should have mandated the deposition of the genome at the very least as blastable contigs. We should not be expected to do this unfinished job. Lets hope it finds its way into Ensembl eventually

ADD REPLYlink modified 7.2 years ago • written 7.2 years ago by cdsouthan1.8k

The authors did submit the contig sequences to genbank, that's why there is an accession number. That is sufficient to make them searchable. If or when there will be a public genome annotation, I don't know.

ADD REPLYlink written 7.2 years ago by Michael Dondrup47k

This is a partial apology in the sense, either I mistyped or its been fixed, but I can now find Crassostrea as a taxonomic select in WGS for a TBLASTN. It then took a few minutes to search P56817, whack contig AFTI01022267 at e-23 and run GENSCAN got the 520 aa ORF right off the bat. I hope I can consequently be forgiven for answering my own question but these contigs had been on hold since 2011. The AFTI01000000 number was dead-linked in the paper because it links to nested set of 7659 contigs and the relationship of these to the SRA links is unclear. It would of still been better it they had asked the NCBI to put it thought GNOME to get the XP ORFs into the database and/or the Nature editors should have said, "good stuff but assemble it please" (see comments at http://cdsouthan.blogspot.se/2012/09/the-pearl-of-oyster-genome-is-missing.html)

ADD REPLYlink modified 7.2 years ago • written 7.2 years ago by cdsouthan1.8k

By the way, nice blog!

ADD REPLYlink written 7.2 years ago by Josh Herr5.7k
1
gravatar for Josh Herr
7.2 years ago by
Josh Herr5.7k
University of Nebraska
Josh Herr5.7k wrote:

I agree with Michael above.

Something that you can do relatively quickly (I think) would be to do gene prediction using something like AUGUSTUS. You'll need to supply a training set or genome, but I have heard people have had good luck using the Human genome for metazoans. From that output you'll have a list of putative proteins which you can BLAST against. You might not find what you are looking for this way, but it's one strategy and it would proceed pretty quickly depending on the size of the draft genome.

I recently predicted genes on a freshly sequenced de novo genome assembly with no real close relative and it worked well and was a swift analysis.

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Josh Herr5.7k

Thanks, I've done exactly this before, using GENSCAN but you need the contigs to tblastn against in the first place to lock down a few partial ORF matches (I cant even find a draft assembly). I don't have the wherewithal or inclination to pipeline the whole data set. just for the half dozen or so proteins I am interested in so I guess I'll just have to wait and see if and when NCBI/UCSC/Ensembl do this.

ADD REPLYlink modified 7.2 years ago • written 7.2 years ago by cdsouthan1.8k

I'm absolutely sorry, I thought you had a draft genome already.

I saw the paper that was just published, but I guess was under the assumption that when there is a paper in Nature, the authors actually release the draft genome sequence to the public. I looked and can't find anything except an old EST database that doesn't even look up and running anymore.

ADD REPLYlink modified 7.2 years ago • written 7.2 years ago by Josh Herr5.7k

If you're looking for some particular proteins, I would go with tblastx as above, rather than trying to predict genes de novo. The latter will IMO require a very reliable genome as well as RNAseq or EST evidence, and even then, you will have ambiguities about gene boundaries and missing exons, etc.

ADD REPLYlink written 7.2 years ago by Ketil4.0k
0
gravatar for sr320
7.2 years ago by
sr3200
sr3200 wrote:

The genome and proteome are available directly from GigaScience.

http://gigadb.org/pacific_oyster/

Zhang, G; Fang, X; Guo, X; Li, L; Luo, R; Xu, F; Yang, P; Zhang, L; Wang, X; Qi, H; Zhu, Y; Yang, L; Huang, Z (2012): Genomic data from the Pacific oyster (Crassostrea gigas). GigaScience. http://dx.doi.org/10.5524/100030

The protein file (fasta) - oyster.v9.glean.final.rename.gff.pep.gz You could download this and do a local blastp.


You can also blast the proteins online at http://oysterdb.cn/blast.html

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by sr3200

Aha, thanks. It would have been nice to have had these links in the paper. Just for the record, we agree (or probably both used GENESCAN) 100% on your protein OYG_10007802. Are you in discussions about an eventual Ensembl inclusion ?

ADD REPLYlink written 7.2 years ago by cdsouthan1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2246 users visited in the last hour