How do I identify gene duplication or paralogs in an annotated reference genome assembly
6 weeks ago

I generated a high quality de novo genome assembly which has been annotated using previously published isoseq data from the species. This genome is now the Refseq for the species. I am currently writing a manuscript detailing the assembly. My PI suggested that I add a quick biologically relevent analysis to the text to show the utility of my genome. Specifically, she said I should investigate if any of a list of candidate genes we have for a specific trait are duplicated or expanded.

To do this, she said I simply need to blast the sequence of my gene of interest , as annotated in this assembly, against the whole assembly. I have looked for papers which do this to reference, and in sifting through them have only succeeded in confusing myself.

As far as I understand, the specific method to do this is the same as this method to identify paralogs using blastp (I am unsure why i would use blastp to identify duplicated genes, but the papers I am reading all seem to agree on blastp instead of blastn):

where I

1) take the FASTA nucleotide sequence for my gene of interest (as determined by the annotation of my genome) and blast (do I use blastn? or blastp?) specifying the database as nr (nonredundant protein) and organism as my species of interest

2) Once I get my blast results back, my top hit will be that same gene I blasted

3) If any other results have a significant E value and score, those are potential paralogs/duplicated genes? How might I verify or validate that. or could I only say these are putative paralogs?

Am I missing anything? Is there a way to screen all annotated genes for duplication/paralogs that would make more sense then repeatedly blasting through the list? I am very lost as to how to proceed

6 weeks ago
sansan_96 ▴ 90


I think MCScanX could help you, there is a lot of information about it and it is easy to use:


