Question

50K genomes to gene alignment

0

Entering edit mode

6 days ago

Madde ▴ 20

Hello,

I have 50,000 genomes annotated with prokka and one gene of interest. Is there a fast way to create an alignment for my gene of interest for all 50,000 genomes.

My first instinct is to use PIRATE. Are there quicker alternatives?

Is there a way to grab, for all 50,000 genomes annotated with prokka, the gene of interest from the .gff file then align all of them. Any tools, scripts, or approaches for this?

Thanks.

genomics pirate roary • 6.5k views

ADD COMMENT • link updated 3 days ago by michael.ante ★ 4.0k • written 6 days ago by Madde ▴ 20

score 2 · Answer 1 · 2025-09-12

2

Entering edit mode

6 days ago

GenoMax 153k

If you know the location of the gene in the genome and it sounds like you do (because you have GTF files), then extracting those sequences (e.g. bedtools getfasta) should be straight forward. It probably does not make any sense to do a MSA with 50K sequences that may be redundant to some extent, so you can make the list non-redundant by clustering and then align that smaller set using MUSCLE, MAFTT, clustal etc.

ADD COMMENT • link 6 days ago by GenoMax 153k

0

Entering edit mode

Since PROKKA was used, you should have all genes in the Fasta file with the suffix .ffn. If the gene name is annotated correctly, you can use e.g. FAST's fasgrep to extract it.

As GenoMax pointed out, that an MSA with 50k sequences is tedious, you can use e.g. MMSEQ2 for clustering.

ADD REPLY • link 3 days ago by michael.ante ★ 4.0k