50K genomes to gene alignment
1
0
Entering edit mode
6 days ago
Madde ▴ 20

Hello,

I have 50,000 genomes annotated with prokka and one gene of interest. Is there a fast way to create an alignment for my gene of interest for all 50,000 genomes.

My first instinct is to use PIRATE. Are there quicker alternatives?

Is there a way to grab, for all 50,000 genomes annotated with prokka, the gene of interest from the .gff file then align all of them. Any tools, scripts, or approaches for this?

Thanks.

genomics pirate roary • 6.5k views
ADD COMMENT
2
Entering edit mode
6 days ago
GenoMax 153k

If you know the location of the gene in the genome and it sounds like you do (because you have GTF files), then extracting those sequences (e.g. bedtools getfasta) should be straight forward. It probably does not make any sense to do a MSA with 50K sequences that may be redundant to some extent, so you can make the list non-redundant by clustering and then align that smaller set using MUSCLE, MAFTT, clustal etc.

ADD COMMENT
0
Entering edit mode

Since PROKKA was used, you should have all genes in the Fasta file with the suffix .ffn. If the gene name is annotated correctly, you can use e.g. FAST's fasgrep to extract it.

As GenoMax pointed out, that an MSA with 50k sequences is tedious, you can use e.g. MMSEQ2 for clustering.

ADD REPLY

Login before adding your answer.

Traffic: 5685 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6