Question: Finding Information About Hypothetical Genes
0
gravatar for Rubal7
8.2 years ago by
Rubal7770
Rubal7770 wrote:

Hello All,

Does anyone have any advice on how to gather information about hypothetical genes, in the sense of predicted genes in the genome that are of unknown function. eg LOC100363218 ? Is there a way to find out the name of known genes with the highest percentage of sequence identity, which although I am aware is no guarantee of similar function, would provide at least some speculative information?

Thanks in advance for your help!

gene function sequence genome • 2.9k views
ADD COMMENTlink modified 4.5 years ago by Biostar ♦♦ 20 • written 8.2 years ago by Rubal7770
1
gravatar for Michael Dondrup
8.2 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

You can always run

  • BLAST (especially blastx the DNA sequence against NR)
  • run it through interproscan

again. The annotation as hypothetical gene might indicate that this has been tried already and no convincing hits were found. However, maybe the annotation is not updated recently, and the databases get updated more often, such that just recently a similar sequence has been added to NR (hopefully not yet another hypothetical protein).

ADD COMMENTlink written 8.2 years ago by Michael Dondrup47k

CG-Pipeline has several modules for annotation including specific modules for BLASTing to Uniprot and InterProScan. The BLAST module can be customized for another protein database such as NR.

This wouldn't be useful for just one specific protein (you should use the web interfaces if you are just performing a few queries) or if you are not familiar with Linux (I'd use CloVR or RAST if you are not familiar with Linux but have several queries). However, it is useful on a large-scale such as whole genome annotation. CG-Pipeline on the whole is optimized for prokaryotes, but for just getting an idea of a gene function, these modules should work well.

http://sourceforge.net/projects/cg-pipeline/

ADD REPLYlink modified 8.2 years ago • written 8.2 years ago by Lee Katz3.0k
1
gravatar for cdsouthan
8.2 years ago by
cdsouthan1.8k
cdsouthan1.8k wrote:

The use of LOC numbers "hypothetical" and "model" can be confusing. You can see the criteria for generating LOC numbers in the Entrez gene guide but most of them are not proteins and this is labeled as a pseuodogene (http://www.ncbi.nlm.nih.gov/gene/100363218). Thus human Entrez gene is ~ 2x the number of protein coding loci. Some protein records are also labeled "hypothetical" even when the ORFs are strongly supported by many mRNA reads from large-scale cDNA projects, it may just mean they have never been curated by RefSeq or Swiss-Prot. As Michael says BLAST and InterProScan are key steps to discern if you have any protein similarity. Perhaps you could expand more on exactly what you want to to do and if you want to be gene-centric or protein-centric.

ADD COMMENTlink written 8.2 years ago by cdsouthan1.8k

Thanks for the explanation. I'm trying to find the potential functions of genes identified through whole genome scans, such as GWAS, so understand if the function of the gene identified would make biological sense as a candidate gene. Obviously when LOC numbers are hypothetical or model this is more challening.

ADD REPLYlink written 8.2 years ago by Rubal7770
1
gravatar for Ashwin
8.2 years ago by
Ashwin110
India
Ashwin110 wrote:

One more thing you can do is, get location information for all hypothetical genes, Ensembl biomart has interface to get all overlapping genes. Ensembl is known to have more number of annotated genes than RefSeq. The solution is fully trivial ant may give you false positives, but its worth trying.

ADD COMMENTlink written 8.2 years ago by Ashwin110
0
gravatar for cdsouthan
8.2 years ago by
cdsouthan1.8k
cdsouthan1.8k wrote:

I would suggest your functional/mechanistic follow-ups of GWAS results should be hypothesis-neutral. Most of the associations scored for marker SNPS and/or haplotype blocks will not locate within gene loci anyway, or may act cis/trans remotely even if they did. Most GWAS results are reported gene-centrically because these are just the genomic signposts we happen know about. You will have to start bottom-up, with conserved patches being one of the key starting points.

ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by cdsouthan1.8k

I completely agree. We are currently looking for changes in conserved positions. However once a loci is indentified we believe it is still worth understanding the function of hypothetical genes in these regions, in order to generate new hypotheses that can then be tested.

ADD REPLYlink written 8.2 years ago by Rubal7770
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1390 users visited in the last hour