generate FASTA/MSA from VCF file
1
0
Entering edit mode
5 weeks ago

I have been trying to generate FASTA sequences for a region using a multi-sample VCF file and a reference genome. I have 70 diploid individuals in the VCF in total, and what I finally want is to obtain 140 sequences, two for each sample. Output format like multiple sequence alignment is also great for me. Is there any script/tools can do this?

I have tried FastaAlternateReferenceMaker (from GATK) but it only gives me a consensus sequence for all the samples.

Any help will be greatly appreciated!

VCF MSA FASTA • 272 views
ADD COMMENT
0
Entering edit mode

extract the VCF for each sample with a loop and bcftools view -s ${SAMPLENAME} ....

ADD REPLY
0
Entering edit mode

Thanks for the reply. I understand that I could iteration the process with a single-sample VCF, but any suggestions for generating two FASTA sequences using VCF?

For example, the VCF:

chr1: 2 A  T 1/0 SNP1
chr1: 5 T  G 0/1 SNP2

The sequence for 1:1-10 in reference genome:
AAAATCCCC

What I want:
>seq1
ATAATCCCC
>seq2
AAAAGCCCC
ADD REPLY
2
Entering edit mode
5 weeks ago
Ared445 ▴ 40

I've been using ENSEMBL's variant effect predictor (VEP) to generate protein fastas using RNA and exome data. It will give a sequence for every variant in the VCF and categorizes them nicely. Really wonderful tool overall.

I haven't looked into it, but I imagine it could give you the nucleotide fasta (?) if it's able to generate the protein fasta so easily. Could be worth a look.

ADD COMMENT
0
Entering edit mode

Thank you so much! I would definitely give it a try!

ADD REPLY

Login before adding your answer.

Traffic: 1951 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6