Question: General guidance with project about gene duplication
gravatar for
17 months ago by
dynev.aw0 wrote:

First post here, so apologizing in advance for formatting issues and other mistakes/slips.

I am currently working on a project based on a certain believable hypothesis, but I feel like my knowledge in this area is lacking some fundamentals.

My goal is to determine (and later compare) a number of gene duplicates in genomes of some mammals for genes which code certain proteins homologous to those found in humans.

I have designed a following pipeline:

  1. Download human proteins fastas from uniprot.
  2. Download genomes from ncbi and make databases from them using makeblastdb.
  3. Run tblastn (e-value = 0.0001) for my set of proteins across all genomes.
  4. Analyze blast output for hits which meet certain criteria:
    • Query coverage > 70%
    • Distance between consequent hsps < 50000 bp and is not below 0 (I have accounted for frame sign)
    • Same domains as in og protein (optional, not yet implemented)
  5. Construct a resulting table of gene duplicates number.
  6. Compare numbers and prove/disprove original hypothesis.

However, I am not completely sure in some of these steps, so here come my questions:

  1. Is it correct to count gene duplication events using this method (tblastn...)?
  2. If so, are my selected criteria correct?
  3. What is a way to test my criteria? Are there databases for gene duplications numbers, at least for a human?
  4. (Should have asked it at the start, but well) Are there any standard methods to do this more efficiently and scientifically correct?

Any other info related to general theory behind the subject is appreciated, as well as criticism.

tblast gene duplication • 321 views
ADD COMMENTlink modified 15 months ago by Biostar ♦♦ 20 • written 17 months ago by dynev.aw0


This is a bit out of my daily routines, but I stumbled over the UCSC RetroGenes track some time ago.

In summary, they align all known mRNAs against the genome and inspect those closer which have at least two distinct alignments.

I hope this might help a bit.



ADD REPLYlink modified 16 months ago • written 16 months ago by michael.ante3.6k

Is there any particular reason you want to work with the genomes? Why not using the (predicted) genes in stead, or are they not available?

ADD REPLYlink written 15 months ago by lieven.sterck9.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2567 users visited in the last hour