Question: How to identify single-copy genes across multiple complete genomes?
0
gravatar for Santiago Montero-Mendieta
13 months ago by
Sweden

A month ago, I asked a question on how to detect paralog sequences in target enrichment when single-copy genes are NOT known: Paralog detection after target capture - HybPiper. A user replied that a de novo assembler such as SPAdes (which is used in HybPiper; a pipeline to extract target sequences from raw reads) would potentially collapse paralogs.

In order to face this problem, someone else suggested me to retrieve complete genomes that are similar enough to my non-model species and build a list of single-copy genes (there are less than 10 available genomes that I could use). Then assume that those genes are also single-copy in my species of interest.

So the question is: how to identify single-copy genes across multiple complete genomes?

ADD COMMENTlink modified 13 months ago • written 13 months ago by Santiago Montero-Mendieta120
1

BUSCO provides a list of universal single copy orthologs. Being "universal" it might miss some specific to the clade. But as suggested by @lieven.sterck makes sense to do create the sets using available tools. To that (eggnog-mapper is also a good tool) . BUSCO ortholog sets could be used to finetune the params (maybe).

ADD REPLYlink written 13 months ago by microfuge1.1k
2
gravatar for lieven.sterck
13 months ago by
lieven.sterck5.4k
VIB, Ghent, Belgium
lieven.sterck5.4k wrote:

Simplest approach will likely be to run something like OrthoFinder and parse the result file (or it might even be given as output by the program)

That will run a blast of all your proteomes, do protein clustering based on the blast results and give you gene fams.

if you want to be more strict you might run several of those tools (orthomcl, inparanoid, ...) and get the consensus list.

ADD COMMENTlink written 13 months ago by lieven.sterck5.4k

Cool, so I will have to filter the Orthogroups (= gene families) and keep single-copy genes only. Seems that someone already did a program for that: https://github.com/davidemms/OrthoFinder/issues/72. But how to match the sequences extracted with HybPiper to the list of single-copy genes? I guess using BLAST (but which e-value threshold?).

ADD REPLYlink modified 13 months ago • written 13 months ago by Santiago Montero-Mendieta120
1

Exactly. I think that with some awk or other linux tools you will also get there.

Yes, then I would use blast indeed. Personally I would take a semi-lenient eval threshold and then filter on eg. HSPmatch length and/or %identity

ADD REPLYlink written 13 months ago by lieven.sterck5.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1287 users visited in the last hour