Incorrect / Unusual Entries In Main Databases (Genbank, Uniprot, Pdb)?
7
12
Entering edit mode
10.3 years ago

Dear Biostar people,

In the scope of a Bioinformatics course, I show some features of the major biological databases (GenBank, UniProt, the PDB...). I also advise my students to be cautious with the data they can find in these databases. To illustrate this, I found quite unusual entries in GenBank:

entry Z71230

FEATURES             Location/Qualifiers
source          1..124
/organism="Nicotiana tabacum"
/organelle="plastid:chloroplast"
/mol_type="genomic DNA"
/isolate="Cuban cahibo cigar, gift from President Fidel
Castro"
/db_xref="taxon:4097


entry NC_001610

FEATURES             Location/Qualifiers
source          1..17084
/organism="Didelphis virginiana"
/organelle="mitochondrion"
/mol_type="genomic DNA"
/db_xref="taxon:9267"
/tissue_type="liver"


The previous examples are clearly not wrong (since other features seem correct) but are quite unusual in a scientific context. However, the following entry taken from the PDB 7GPB, chain D, residue 67 is clearly wrong with a Tryptophan completely kinked:

So my question is the following, do you know other examples in GenBank, UniProt, PDB... of such unusual or wrong entries?

Thanks.

genbank pdb quality • 5.0k views
1
Entering edit mode

Fun question. I'm sure all databases are riddled with questionable entries. My question would be: how can we best identify them using automated methods?

0
Entering edit mode

For structures in the PDB, there is the [?]PDBREPORT database[?] that does the job but still requires a human expertise to decide whether or not a structure is "OK".

0
Entering edit mode

For structures in the PDB, there is the PDBREPORT database swift.cmbi.ru.nl/gv/pdbreport/ that does the job but still requires a human expertise to decide whether or not a structure is "OK"

0
Entering edit mode

Just occurred to me that this question should probably be set to community wiki.

0
Entering edit mode

Dear BioStar community,

0
Entering edit mode

You all officially: Made my day. :D

8
Entering edit mode
10.3 years ago
Mary 11k

My favorite bizarre database item was a PubMed one. This was long before that NCBI ROLF blog was created. I was searching for genes identified in the transition to gray hair. This was not useful....

This is the TITLE (note, not the abstract):

I am a 64-year-old man, and I've always been proud of my perfect health record. I've also been proud of my full head of hair, even after the gray started creeping in. Four months ago I caught pneumonia and spent eight days in the hospital (three in intensive care). It took a while, but I'm finally back to normal - except that my hair is falling out. It comes out in clumps when I shampoo or even comb it, and it's gotten noticeably thin all over. I remember reading about Propecia in your newsletter but I don't have the old issue. Should I try the medication?

0
Entering edit mode

Brilliant. I always forget, there are some odd magazines in PubMed.

3
Entering edit mode
10.3 years ago

"Organism" is: "Tyrannosaurus rex"

Edit:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2781113/ : "Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies"

http://www.ncbi.nlm.nih.gov/protein/17987990

Fig 6: "The sequence annotated in GenBank as a mandelate racemase (gi|17987990, yellow dot) clusters with fuconate dehydratases (red cluster) suggesting that it should be annotated as a fuconate dehydratase instead of as a mandelate racemase."

EDIT2:

The following sequence ("gene 7 {3' end, 5' end, segment 7} [human rotavirus, strain Wa, Genomic RNA, 425 nt 2 segments]") only contains some 'N':

2
Entering edit mode
10.3 years ago
Neilfws 49k

Lots of ribosomal RNAs are misannotated as protein-coding. This seems to happen because draft sequence has been checked for potential ORFs (sometimes, using nothing more than blastx), but rRNA identification is not part of the annotation pipeline.

There's a recent publication on this topic. I also blogged about it a few years ago, where I also mention the problem of misannotation as pseudogenes due to in-frame stop codons, which in fact code for "non-standard" amino acids.

Estimating the annotation error rate of curated GO database sequence annotations

2
Entering edit mode
10.3 years ago
Bill ▴ 20

GenBank has "human" DNA sequences which appear to have come from mycoplasma mold http://arxiv.org/abs/1106.4192 Bill

2
Entering edit mode
10.3 years ago
Nick Loman ▴ 610

The "diarrheal toxin" SA0276 in Staph aureus subsp. aureus N315 and other genomes is a classic misannotation (http://www.ncbi.nlm.nih.gov/protein/NP_373522.1). It is in fact an FtsK/SpoIII-domain containing protein, part of a Type VII secretion system. The misannotation is discussed here (http://www.pnas.org/content/102/4/1169.full). We use this as part of a bioinformatics practical to teach students about annotation (http://www.infection.bham.ac.uk/Teaching/Bioinformatics_BSc/Intro/Bioinfo_intro_pract.doc).

2
Entering edit mode
10.3 years ago
Hst ▴ 20

PDBWiki was created for the particular purpose of tracking "unusual" entries in the PDB. Have a look at the example annotations page. Not all of the annotations listed there are errors but you will find some good examples for why students should be cautious when using biological databases.

The reference is:

PDBWiki: added value through community annotation of the Protein Data Bank

1
Entering edit mode
10.3 years ago
Alex ★ 1.5k

Mary said about a title "I am a 64-year-old man, and I've always been proud of my perfect health record..." But the author of this title has other 226 articles http://www.ncbi.nlm.nih.gov/pubmed?term=%22Simon%20HB%22%5BAuthor%5D =)

0
Entering edit mode

And he solved the puzzle of aging: I am a healthy, active 39-year-old guy. ;)

It seems to me that Simon HB. just answers questions asked by several people.