Question

Finding Information About Hypothetical Genes

0

Entering edit mode

11.8 years ago

Rubal7 ▴ 830

Hello All,

Does anyone have any advice on how to gather information about hypothetical genes, in the sense of predicted genes in the genome that are of unknown function. eg LOC100363218 ? Is there a way to find out the name of known genes with the highest percentage of sequence identity, which although I am aware is no guarantee of similar function, would provide at least some speculative information?

Thanks in advance for your help!

gene function genome sequence • 4.2k views

ADD COMMENT • link updated 8.1 years ago by Biostar 20 • written 11.8 years ago by Rubal7 ▴ 830

score 1 · Answer 1 · 2012-06-26

1

Entering edit mode

11.8 years ago

Michael 54k

You can always run

BLAST (especially blastx the DNA sequence against NR)
run it through interproscan

again. The annotation as hypothetical gene might indicate that this has been tried already and no convincing hits were found. However, maybe the annotation is not updated recently, and the databases get updated more often, such that just recently a similar sequence has been added to NR (hopefully not yet another hypothetical protein).

ADD COMMENT • link 11.8 years ago by Michael 54k

0

Entering edit mode

CG-Pipeline has several modules for annotation including specific modules for BLASTing to Uniprot and InterProScan. The BLAST module can be customized for another protein database such as NR.

This wouldn't be useful for just one specific protein (you should use the web interfaces if you are just performing a few queries) or if you are not familiar with Linux (I'd use CloVR or RAST if you are not familiar with Linux but have several queries). However, it is useful on a large-scale such as whole genome annotation. CG-Pipeline on the whole is optimized for prokaryotes, but for just getting an idea of a gene function, these modules should work well.

http://sourceforge.net/projects/cg-pipeline/

ADD REPLY • link 11.8 years ago by Lee Katz ★ 3.1k

score 1 · Answer 2 · 2012-06-26

The use of LOC numbers "hypothetical" and "model" can be confusing. You can see the criteria for generating LOC numbers in the Entrez gene guide but most of them are not proteins and this is labeled as a pseuodogene (http://www.ncbi.nlm.nih.gov/gene/100363218). Thus human Entrez gene is ~ 2x the number of protein coding loci. Some protein records are also labeled "hypothetical" even when the ORFs are strongly supported by many mRNA reads from large-scale cDNA projects, it may just mean they have never been curated by RefSeq or Swiss-Prot. As Michael says BLAST and InterProScan are key steps to discern if you have any protein similarity. Perhaps you could expand more on exactly what you want to to do and if you want to be gene-centric or protein-centric.

score 1 · Answer 3 · 2012-06-26

1

Entering edit mode

11.8 years ago

Ashwin ▴ 110

One more thing you can do is, get location information for all hypothetical genes, Ensembl biomart has interface to get all overlapping genes. Ensembl is known to have more number of annotated genes than RefSeq. The solution is fully trivial ant may give you false positives, but its worth trying.

ADD COMMENT • link 11.8 years ago by Ashwin ▴ 110

score 0 · Answer 4 · 2012-07-02

0

Entering edit mode

11.8 years ago

cdsouthan ★ 1.9k

I would suggest your functional/mechanistic follow-ups of GWAS results should be hypothesis-neutral. Most of the associations scored for marker SNPS and/or haplotype blocks will not locate within gene loci anyway, or may act cis/trans remotely even if they did. Most GWAS results are reported gene-centrically because these are just the genomic signposts we happen know about. You will have to start bottom-up, with conserved patches being one of the key starting points.

ADD COMMENT • link 11.8 years ago by cdsouthan ★ 1.9k

0

Entering edit mode

I completely agree. We are currently looking for changes in conserved positions. However once a loci is indentified we believe it is still worth understanding the function of hypothetical genes in these regions, in order to generate new hypotheses that can then be tested.

ADD REPLY • link 11.8 years ago by Rubal7 ▴ 830