Retrieve specific fasta sequences from a group of assemblies
0
0
Entering edit mode
7 weeks ago
SushiRoll ▴ 100

Hi all,

Sorry if this question has been addressed before but I haven't been able to find a solution to this. I have a lot of assemblies (around 800) and I would like to retrieve the fasta sequence for a specific housekeeping gene which should (in theory) be present in all of them. Is there any tool that can take the fasta assembly as input and retrieve a specific gene with certain % variation to retrieve the gene even if it has mutations? Alternatively it could take a gbk or gff3 as input and use the gene annotation as retrieval criterion.

Thanks a lot!

CDS gene sequence • 414 views
1
Entering edit mode

Don't know if there is a ready made tool. You will need to align the gene to your assemblies and then it is a matter of parsing the results and retrieving the sequence you need using samtools faidx and similar options.

0
Entering edit mode

Great, I'll give it a shot.

Thanks!

1
Entering edit mode

Personally, I never worked on similar tasks and thus unfortunately can't provide you with a polished solution, but what you are trying here is to find orthologous genes. Using this keyword, you should find tools suitable for this task, e.g. OrthoFinder showed up in a quick search.

0
Entering edit mode

Thanks Matthias, that's a great starting point, I'll check what's out there.