How to identify single-copy genes across multiple complete genomes?
1
0
Entering edit mode
3.1 years ago

A month ago, I asked a question on how to detect paralog sequences in target enrichment when single-copy genes are NOT known: Paralog detection after target capture - HybPiper. A user replied that a de novo assembler such as SPAdes (which is used in HybPiper; a pipeline to extract target sequences from raw reads) would potentially collapse paralogs.

In order to face this problem, someone else suggested me to retrieve complete genomes that are similar enough to my non-model species and build a list of single-copy genes (there are less than 10 available genomes that I could use). Then assume that those genes are also single-copy in my species of interest.

So the question is: how to identify single-copy genes across multiple complete genomes?

target capture hybpiper spades single-copy genes • 1.9k views
1
Entering edit mode

BUSCO provides a list of universal single copy orthologs. Being "universal" it might miss some specific to the clade. But as suggested by @lieven.sterck makes sense to do create the sets using available tools. To that (eggnog-mapper is also a good tool) . BUSCO ortholog sets could be used to finetune the params (maybe).

2
Entering edit mode
3.1 years ago

Simplest approach will likely be to run something like OrthoFinder and parse the result file (or it might even be given as output by the program)

That will run a blast of all your proteomes, do protein clustering based on the blast results and give you gene fams.

if you want to be more strict you might run several of those tools (orthomcl, inparanoid, ...) and get the consensus list.

0
Entering edit mode

Cool, so I will have to filter the Orthogroups (= gene families) and keep single-copy genes only. Seems that someone already did a program for that: https://github.com/davidemms/OrthoFinder/issues/72. But how to match the sequences extracted with HybPiper to the list of single-copy genes? I guess using BLAST (but which e-value threshold?).

1
Entering edit mode

Exactly. I think that with some awk or other linux tools you will also get there.

Yes, then I would use blast indeed. Personally I would take a semi-lenient eval threshold and then filter on eg. HSPmatch length and/or %identity