Incorrect / Unusual Entries In Main Databases (Genbank, Uniprot, Pdb)?
7
12
Entering edit mode
12.7 years ago

Dear Biostar people,

In the scope of a Bioinformatics course, I show some features of the major biological databases (GenBank, UniProt, the PDB...). I also advise my students to be cautious with the data they can find in these databases. To illustrate this, I found quite unusual entries in GenBank:

entry Z71230

FEATURES             Location/Qualifiers
     source          1..124
                     /organism="Nicotiana tabacum"
                     /organelle="plastid:chloroplast"
                     /mol_type="genomic DNA"
                     /isolate="Cuban cahibo cigar, gift from President Fidel
                     Castro"
                     /db_xref="taxon:4097 

entry NC_001610

FEATURES             Location/Qualifiers
     source          1..17084
                     /organism="Didelphis virginiana"
                     /organelle="mitochondrion"
                     /mol_type="genomic DNA"
                     /isolate="fresh road killed individual"
                     /db_xref="taxon:9267"
                     /tissue_type="liver"
                     /dev_stage="adult"

The previous examples are clearly not wrong (since other features seem correct) but are quite unusual in a scientific context. However, the following entry taken from the PDB 7GPB, chain D, residue 67 is clearly wrong with a Tryptophan completely kinked: alt text

So my question is the following, do you know other examples in GenBank, UniProt, PDB... of such unusual or wrong entries?

Thanks.

genbank pdb quality • 6.3k views
ADD COMMENT
1
Entering edit mode

Fun question. I'm sure all databases are riddled with questionable entries. My question would be: how can we best identify them using automated methods?

ADD REPLY
0
Entering edit mode

For structures in the PDB, there is the [?]PDBREPORT database[?] that does the job but still requires a human expertise to decide whether or not a structure is "OK".

ADD REPLY
0
Entering edit mode

For structures in the PDB, there is the PDBREPORT database swift.cmbi.ru.nl/gv/pdbreport/ that does the job but still requires a human expertise to decide whether or not a structure is "OK"

ADD REPLY
0
Entering edit mode

Just occurred to me that this question should probably be set to community wiki.

ADD REPLY
0
Entering edit mode

Dear BioStar community,

Thank you all for your interesting answers and comments.

ADD REPLY
0
Entering edit mode

You all officially: Made my day. :D

ADD REPLY
8
Entering edit mode
12.7 years ago
Mary 11k

My favorite bizarre database item was a PubMed one. This was long before that NCBI ROLF blog was created. I was searching for genes identified in the transition to gray hair. This was not useful....

http://www.ncbi.nlm.nih.gov/pubmed/12079806

This is the TITLE (note, not the abstract):

I am a 64-year-old man, and I've always been proud of my perfect health record. I've also been proud of my full head of hair, even after the gray started creeping in. Four months ago I caught pneumonia and spent eight days in the hospital (three in intensive care). It took a while, but I'm finally back to normal - except that my hair is falling out. It comes out in clumps when I shampoo or even comb it, and it's gotten noticeably thin all over. I remember reading about Propecia in your newsletter but I don't have the old issue. Should I try the medication?

ADD COMMENT
0
Entering edit mode

Brilliant. I always forget, there are some odd magazines in PubMed.

ADD REPLY
3
Entering edit mode
12.7 years ago

"Organism" is: "Tyrannosaurus rex"

http://www.ncbi.nlm.nih.gov/protein/160332318

Edit:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2781113/ : "Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies"

http://www.ncbi.nlm.nih.gov/protein/17987990

Fig 6: "The sequence annotated in GenBank as a mandelate racemase (gi|17987990, yellow dot) clusters with fuconate dehydratases (red cluster) suggesting that it should be annotated as a fuconate dehydratase instead of as a mandelate racemase."

EDIT2:

The following sequence ("gene 7 {3' end, 5' end, segment 7} [human rotavirus, strain Wa, Genomic RNA, 425 nt 2 segments]") only contains some 'N':

http://www.ncbi.nlm.nih.gov/nuccore/252544?fmt_mask=65536

ADD COMMENT
2
Entering edit mode
12.7 years ago
Neilfws 49k

Lots of ribosomal RNAs are misannotated as protein-coding. This seems to happen because draft sequence has been checked for potential ORFs (sometimes, using nothing more than blastx), but rRNA identification is not part of the annotation pipeline.

There's a recent publication on this topic. I also blogged about it a few years ago, where I also mention the problem of misannotation as pseudogenes due to in-frame stop codons, which in fact code for "non-standard" amino acids.

Also worth reading:

Estimating the annotation error rate of curated GO database sequence annotations

ADD COMMENT
2
Entering edit mode
12.7 years ago
Bill ▴ 20

GenBank has "human" DNA sequences which appear to have come from mycoplasma mold http://arxiv.org/abs/1106.4192 Bill

ADD COMMENT
2
Entering edit mode
12.7 years ago
Nick Loman ▴ 610

The "diarrheal toxin" SA0276 in Staph aureus subsp. aureus N315 and other genomes is a classic misannotation (http://www.ncbi.nlm.nih.gov/protein/NP_373522.1). It is in fact an FtsK/SpoIII-domain containing protein, part of a Type VII secretion system. The misannotation is discussed here (http://www.pnas.org/content/102/4/1169.full). We use this as part of a bioinformatics practical to teach students about annotation (http://www.infection.bham.ac.uk/Teaching/Bioinformatics_BSc/Intro/Bioinfo_intro_pract.doc).

ADD COMMENT
2
Entering edit mode
12.7 years ago
Hst ▴ 20

PDBWiki was created for the particular purpose of tracking "unusual" entries in the PDB. Have a look at the example annotations page. Not all of the annotations listed there are errors but you will find some good examples for why students should be cautious when using biological databases.

The reference is:

PDBWiki: added value through community annotation of the Protein Data Bank

ADD COMMENT
1
Entering edit mode
12.7 years ago
Alex ★ 1.5k

Mary said about a title "I am a 64-year-old man, and I've always been proud of my perfect health record..." But the author of this title has other 226 articles http://www.ncbi.nlm.nih.gov/pubmed?term=%22Simon%20HB%22%5BAuthor%5D =)

ADD COMMENT
0
Entering edit mode

And he solved the puzzle of aging: I am a healthy, active 39-year-old guy. ;)

It seems to me that Simon HB. just answers questions asked by several people.

ADD REPLY

Login before adding your answer.

Traffic: 1516 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6