Identify SNP, Indel variants from a List of FASTA sequence
1
1
Entering edit mode
9 days ago
Trinh ▴ 10

I have a FASTA file containing ~500 DNA sequences of a specific gene collected from various yeast strains. Each sequence is labeled with the corresponding strain name in the header. I would like to identify mutational variants—both SNPs and indels—across these sequences. My goal is to annotate these variants at both the cDNA and protein levels.

While identifying SNPs and translating them into protein variants seems relatively straightforward, handling indels has proven challenging, particularly when determining their correct impact on the translated peptide sequence due to potential frameshifts.

I’m currently working in Python but am also open to R-based solutions. I would appreciate any recommendations for existing tools, workflows, or published scripts/tutorials that are designed for this use case—especially those that can correctly manage indels and their effect on protein sequences.

Thank you very much for your time and help.

SNP • 423 views
ADD COMMENT
2
Entering edit mode
4 days ago
Mark ★ 1.7k

Your best bet is to output your variant calls in a vcf then using something like snpeff to annotate variants with the effect.

Calling indels is tricky, I don't know of any tools to perform variant calling from (what I assume you have) an MSA.

You might want to checkout the augur toolkit, specifically augur translate, it has some interesting python code (using biopython) that performs translation that you might be able to amend to your needs.

Having said all the above, I think the best option is to create synthetic reads from the fasta, use these reads to perform alignment and variant calling to an annotated reference, then proceed with snpeff annotations. This will allow you to use all the variant calling and associated tools to predict effect.

My suggestion is to follow GATK best practices for variant calling.

ADD COMMENT

Login before adding your answer.

Traffic: 2338 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6