Question: Incorrect / Unusual Entries In Main Databases (Genbank, Uniprot, Pdb)?
12
gravatar for Pierre Poulain
5.7 years ago by
France
Pierre Poulain440 wrote:

Dear Biostar people,

In the scope of a Bioinformatics course, I show some features of the major biological databases (GenBank, UniProt, the PDB...). I also advise my students to be cautious with the data they can find in these databases. To illustrate this, I found quite unusual entries in GenBank:

entry Z71230

FEATURES             Location/Qualifiers
     source          1..124
                     /organism="Nicotiana tabacum"
                     /organelle="plastid:chloroplast"
                     /mol_type="genomic DNA"
                     /isolate="Cuban cahibo cigar, gift from President Fidel
                     Castro"
                     /db_xref="taxon:4097 

entry NC_001610

FEATURES             Location/Qualifiers
     source          1..17084
                     /organism="Didelphis virginiana"
                     /organelle="mitochondrion"
                     /mol_type="genomic DNA"
                     /isolate="fresh road killed individual"
                     /db_xref="taxon:9267"
                     /tissue_type="liver"
                     /dev_stage="adult"

The previous examples are clearly not wrong (since other features seem correct) but are quite unusual in a scientific context. However, the following entry taken from the PDB 7GPB, chain D, residue 67 is clearly wrong with a Tryptophan completely kinked: alt text

So my question is the following, do you know other examples in GenBank, UniProt, PDB... of such unusual or wrong entries?

Thanks.

pdb quality genbank • 3.3k views
ADD COMMENTlink modified 3.2 years ago by Biostar ♦♦ 10 • written 5.7 years ago by Pierre Poulain440
1

Fun question. I'm sure all databases are riddled with questionable entries. My question would be: how can we best identify them using automated methods?

ADD REPLYlink written 5.7 years ago by Neilfws46k

For structures in the PDB, there is the [?]PDBREPORT database[?] that does the job but still requires a human expertise to decide whether or not a structure is "OK".

ADD REPLYlink written 5.7 years ago by Pierre Poulain440

For structures in the PDB, there is the PDBREPORT database swift.cmbi.ru.nl/gv/pdbreport/ that does the job but still requires a human expertise to decide whether or not a structure is "OK"

ADD REPLYlink written 5.7 years ago by Pierre Poulain440

Just occurred to me that this question should probably be set to community wiki.

ADD REPLYlink written 5.6 years ago by Neilfws46k
8
gravatar for Mary
5.7 years ago by
Mary11k
Boston MA area
Mary11k wrote:

My favorite bizarre database item was a PubMed one. This was long before that NCBI ROLF blog was created. I was searching for genes identified in the transition to gray hair. This was not useful....

http://www.ncbi.nlm.nih.gov/pubmed/12079806

This is the TITLE (note, not the abstract):

I am a 64-year-old man, and I've always been proud of my perfect health record. I've also been proud of my full head of hair, even after the gray started creeping in. Four months ago I caught pneumonia and spent eight days in the hospital (three in intensive care). It took a while, but I'm finally back to normal - except that my hair is falling out. It comes out in clumps when I shampoo or even comb it, and it's gotten noticeably thin all over. I remember reading about Propecia in your newsletter but I don't have the old issue. Should I try the medication?

ADD COMMENTlink written 5.7 years ago by Mary11k

Brilliant. I always forget, there are some odd magazines in PubMed.

ADD REPLYlink written 5.7 years ago by Neilfws46k
3
gravatar for Pierre Lindenbaum
5.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum91k wrote:

"Organism" is: "Tyrannosaurus rex"

http://www.ncbi.nlm.nih.gov/protein/160332318

Edit:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2781113/ : "Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies"

http://www.ncbi.nlm.nih.gov/protein/17987990

Fig 6: "The sequence annotated in GenBank as a mandelate racemase (gi|17987990, yellow dot) clusters with fuconate dehydratases (red cluster) suggesting that it should be annotated as a fuconate dehydratase instead of as a mandelate racemase."

EDIT2:

The following sequence ("gene 7 {3' end, 5' end, segment 7} [human rotavirus, strain Wa, Genomic RNA, 425 nt 2 segments]") only contains some 'N':

http://www.ncbi.nlm.nih.gov/nuccore/252544?fmt_mask=65536

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Pierre Lindenbaum91k
2
gravatar for Bill
5.7 years ago by
Bill20
Bill20 wrote:

GenBank has "human" DNA sequences which appear to have come from mycoplasma mold http://arxiv.org/abs/1106.4192 Bill

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Bill20
2
gravatar for Nick Loman
5.7 years ago by
Nick Loman600
United Kingdom
Nick Loman600 wrote:

The "diarrheal toxin" SA0276 in Staph aureus subsp. aureus N315 and other genomes is a classic misannotation (http://www.ncbi.nlm.nih.gov/protein/NP_373522.1). It is in fact an FtsK/SpoIII-domain containing protein, part of a Type VII secretion system. The misannotation is discussed here (http://www.pnas.org/content/102/4/1169.full). We use this as part of a bioinformatics practical to teach students about annotation (http://www.infection.bham.ac.uk/Teaching/Bioinformatics_BSc/Intro/Bioinfo_intro_pract.doc).

ADD COMMENTlink written 5.7 years ago by Nick Loman600
2
gravatar for Hst
5.6 years ago by
Hst20
Hst20 wrote:

PDBWiki was created for the particular purpose of tracking "unusual" entries in the PDB. Have a look at the example annotations page. Not all of the annotations listed there are errors but you will find some good examples for why students should be cautious when using biological databases.

The reference is:

PDBWiki: added value through community annotation of the Protein Data Bank

ADD COMMENTlink written 5.6 years ago by Hst20
1
gravatar for Neilfws
5.7 years ago by
Neilfws46k
Sydney, Australia
Neilfws46k wrote:

Lots of ribosomal RNAs are misannotated as protein-coding. This seems to happen because draft sequence has been checked for potential ORFs (sometimes, using nothing more than blastx), but rRNA identification is not part of the annotation pipeline.

There's a recent publication on this topic. I also blogged about it a few years ago, where I also mention the problem of misannotation as pseudogenes due to in-frame stop codons, which in fact code for "non-standard" amino acids.

Also worth reading:

Estimating the annotation error rate of curated GO database sequence annotations

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Neilfws46k
1
gravatar for Alex
5.7 years ago by
Alex1.4k
Theodosius Dobzhansky Center for Genome Bioinformatics
Alex1.4k wrote:

Mary said about a title "I am a 64-year-old man, and I've always been proud of my perfect health record..." But the author of this title has other 226 articles http://www.ncbi.nlm.nih.gov/pubmed?term=%22Simon%20HB%22%5BAuthor%5D =)

ADD COMMENTlink written 5.7 years ago by Alex1.4k

And he solved the puzzle of aging: I am a healthy, active 39-year-old guy. ;)

It seems to me that Simon HB. just answers questions asked by several people.

ADD REPLYlink written 5.6 years ago by Cjt360
0
gravatar for Pierre Poulain
5.6 years ago by
France
Pierre Poulain440 wrote:

Dear BioStar community,

Thank you all for your interesting answers and comments.

ADD COMMENTlink written 5.6 years ago by Pierre Poulain440
0
gravatar for Fabian Bull
5.6 years ago by
Fabian Bull1.2k
German
Fabian Bull1.2k wrote:

You all officially: Made my day. :D

ADD COMMENTlink written 5.6 years ago by Fabian Bull1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1306 users visited in the last hour