Getting A Vcf File From A Fasta Alignment
2
4
Entering edit mode
8.6 years ago
Bioch'Ti ★ 1.1k

Hi All,

I'm wondering if anybody is aware of a tool that converts a fasta alignment (that includes fasta sequences of different individuals) into a vcf file? I know that the opposite is feasible (vcf to fasta) using the GATK dedicated tool, but I want to perform the opposite !

Thanks for your help, C.

vcf fasta conversion alignment • 27k views
ADD COMMENT
0
Entering edit mode

what's your input ? a CLUSTALW aln file ?

ADD REPLY
0
Entering edit mode

The input file is a txt file (.fas for example) that contains only the aligned sequences (two lines per individual : one of comment, the second one for the sequence itself). Here is an example:

>Ind1
ACGTGGCTAGATCA
>Ind2
ACGTGGCTAGATCA
>Ind3
ACGTGCCTAGATCA
ADD REPLY
0
Entering edit mode

how are managed the indels ?

ADD REPLY
0
Entering edit mode

It will be nice to manage the indels too to make the script more general. But in my case, I do not need to consider these markers.

ADD REPLY
0
Entering edit mode

I added a support to MSA/Fasta. See below.

ADD REPLY
9
Entering edit mode
8.6 years ago

I quickly wrote something for a CLUSTALW alignment (not fasta). See: http://lindenb.github.io/jvarkit/MsaToVcf.html

It seems to work with the following clustalw input:

$ curl https://raw.github.com/biopython/biopython/master/Tests/Clustalw/opuntia.aln

CLUSTAL W (1.81) multiple sequence alignment


gi|6273285|gb|AF191659.1|AF191      TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273284|gb|AF191658.1|AF191      TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273287|gb|AF191661.1|AF191      TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273286|gb|AF191660.1|AF191      TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273290|gb|AF191664.1|AF191      TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273289|gb|AF191663.1|AF191      TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273291|gb|AF191665.1|AF191      TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
                                    ******* **** *************************************

gi|6273285|gb|AF191659.1|AF191      TATATA----------ATATATTTCAAATTTCCTTATATACCCAAATATA
gi|6273284|gb|AF191658.1|AF191      TATATATA--------ATATATTTCAAATTTCCTTATATACCCAAATATA
gi|6273287|gb|AF191661.1|AF191      TATATA----------ATATATTTCAAATTTCCTTATATATCCAAATATA
gi|6273286|gb|AF191660.1|AF191      TATATA----------ATATATTTATAATTTCCTTATATATCCAAATATA
gi|6273290|gb|AF191664.1|AF191      TATATATATA------ATATATTTCAAATTCCCTTATATATCCAAATATA
gi|6273289|gb|AF191663.1|AF191      TATATATATA------ATATATTTCAAATTCCCTTATATATCCAAATATA
gi|6273291|gb|AF191665.1|AF191      TATATATATATATATAATATATTTCAAATTCCCTTATATATCCAAATATA
                                    ******          ********  **** ********* *********

gi|6273285|gb|AF191659.1|AF191      AAAATATCTAATAAATTAGATGAATATCAAAGAATCCATTGATTTAGTGT
gi|6273284|gb|AF191658.1|AF191      AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGT
gi|6273287|gb|AF191661.1|AF191      AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGT
gi|6273286|gb|AF191660.1|AF191      AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGT
gi|6273290|gb|AF191664.1|AF191      AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGT
gi|6273289|gb|AF191663.1|AF191      AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTAT
gi|6273291|gb|AF191665.1|AF191      AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGT
                                    ************************************ *********** *

gi|6273285|gb|AF191659.1|AF191      ACCAGA
gi|6273284|gb|AF191658.1|AF191      ACCAGA
gi|6273287|gb|AF191661.1|AF191      ACCAGA
gi|6273286|gb|AF191660.1|AF191      ACCAGA
gi|6273290|gb|AF191664.1|AF191      ACCAGA
gi|6273289|gb|AF191663.1|AF191      ACCAGA
gi|6273291|gb|AF191665.1|AF191      ACCAGA
                                    ******



$ curl https://raw.github.com/biopython/biopython/master/Tests/Clustalw/opuntia.aln" |\
  java -jar dist/biostar94573.jar

##fileformat=VCFv4.1
##Biostar94573CmdLine=
##Biostar94573Version=ca765415946f3ed0827af0773128178bc6aa2f62
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth.">
##contig=<ID=chrUn,length=156>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  gi|6273284|gb|AF191658.1|AF191  gi|6273285|gb|AF191659.1|AF191  gi|6273286|gb|AF191660.1|AF191  gi|6273287|gb|AF191661.1|AF191  gi|6273289|gb|AF191663.1|AF191  gi|6273290|gb|AF191664.1|AF191  gi|6273291|gb|AF191665.1|AF191
chrUn   8   .   T   A   .   .   DP=7    GT:DP   0:1 0:1 1:1 0:1 0:1 0:1 0:1
chrUn   13  .   A   G   .   .   DP=7    GT:DP   0:1 0:1 0:1 0:1 1:1 1:1 1:1
chrUn   56  .   ATATATATATA ATA,A,ATATA .   .   DP=7    GT:DP   1:1 2:1 2:1 2:1 3:1 3:1 0:1
chrUn   74  .   TCA TAT .   .   DP=7    GT:DP   0:1 0:1 1:1 0:1 0:1 0:1 0:1
chrUn   81  .   T   C   .   .   DP=7    GT:DP   0:1 0:1 0:1 0:1 1:1 1:1 1:1
chrUn   91  .   T   C   .   .   DP=7    GT:DP   1:1 1:1 0:1 0:1 0:1 0:1 0:1
chrUn   137 .   T   C   .   .   DP=7    GT:DP   0:1 1:1 0:1 0:1 0:1 0:1 0:1
chrUn   149 .   G   A   .   .   DP=7    GT:DP   0:1 0:1 0:1 0:1 1:1 0:1 0:1

Edit

I added a support for fasta/MSA

$ cat input.msa 
>Ind1
ACGTGGCTAGATCA
>Ind2
ACGTGGCTAGATCA
>Ind3
ACGTGCCTAGATCA

get the output:

$ cat input.msa | java -jar dist/biostar94573.jar 
##fileformat=VCFv4.1
##Biostar94573CmdLine=
##Biostar94573Version=4094eaed3dd2c9364309c20f88c6f79ad54a7450
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth.">
##contig=<ID=chrUn,length=14>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Ind1    Ind2    Ind3
chrUn   6   .   G   C   .   .   DP=3    GT:DP   0:1 0:1 1
ADD COMMENT
0
Entering edit mode

Hi Pierre,

I am looking for a program for converting fasta -> vcf and I found this discussion and your code very helpful. However, I have a couple of questions regarding this output -

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Ind1    Ind2    Ind3

chrUn 6 . G C . . DP=3 GT:DP 0:1 0:1 1

Shouldn't the genotype for Ind1 = 0/0 and Ind2 = 0/0 and Ind3 = 1/1? Also does this program allow input as a consensus sequence - for instance, there could be sites such Y and R in the sequence and the vcf could translate these as a A/T or C/G?

ADD REPLY
0
Entering edit mode

you're right:I've fixed the bug! thank you

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Ind1    Ind2    Ind3 chrUn   6       .       G       C       .       .       DP=3    GT:DP   0/0:1   0/0:1   1/1:1

consensus sequence: no(I don't have time for this)

ADD REPLY
0
Entering edit mode

Thanks! This is very helpful.

ADD REPLY
0
Entering edit mode

Hi Pierre, I have many vcf files now, generated courtesy of your script. I only need SNP counts for the purposes of my analyses. Is there a way to get SNP counts (only) for all my vcf files? I thought to use grep, but I noticed I was getting reports with indels, too.

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode
[filter.c:1298 filters_init1] Error: the tag "INFO/snp" is not defined in the VCF header

error in vcf file generated by your script, when using bcftools filter with TYPE="snp".

ADD REPLY
0
Entering edit mode

https://vcftools.github.io/man_latest.html

vcftools --vcf input_file.vcf --remove-indels --recode --recode-INFO-all --out SNPs_only

I finally fix it!

ADD REPLY
0
Entering edit mode
hrUn    14005    .    TCG    TCA,TAG ##actually 2 snp
chrUn    28819    .    TTCA    TGAA,TTCC ##actually 3snp

Could you separate them, not together?

ADD REPLY
0
Entering edit mode

This can be done with other tools - you want to decompose complex variants into simple variants. I've never tried, but I think e.g. bcftools norm could be used for this job. I think also vcflib should have some tool for this.

ADD REPLY
0
Entering edit mode

Pierre, many thanks for providing this script, works very well. I used this script for one fasta alignment, but I would like to use it for thousands of fasta alignments. Is there a way to do this separately for each locus all at once?

ADD REPLY
0
Entering edit mode

just concatenate

  • the files with cat?
  • or merge the vcf later?
ADD REPLY
0
Entering edit mode

Right, but I think that would be violating homology, no? Desired workflow: fasta > vcf > snp count (for each locus/alignment). Many thanks in advance.

ADD REPLY
0
Entering edit mode

Apologies if I am not making any sense. Either way, thank you for providing script.

ADD REPLY
0
Entering edit mode

Should I use a for loop?

ADD REPLY
0
Entering edit mode

Hi Pierre,

Thanks for this! Would it be possible to output EVERY position on the reference as VCF format? (no coverage = ./., and if same as reference = 0/0)? Otherwise I can't merge VCF files of different samples together....

ADD REPLY
0
Entering edit mode

ask this as an issue at https://github.com/lindenb/jvarkit/issues . Anyway, I away from my sources for a few days.

ADD REPLY
0
Entering edit mode

I think this should work fine with bcftools merge. as long as the VCFs are all called against the same reference.

ADD REPLY
0
Entering edit mode

I try to run this: cat merged_fasta.aln | java -jar dist/biostar94573.jar but it gives me this: Unable to access jarfile dist/biostar94573.jar

ADD REPLY
0
Entering edit mode

Do you have the program from github in the path you specified (i.e. dist/biostar94573.jar)? Github link is in the first post..

ADD REPLY
0
Entering edit mode

Do you have the program from github in the path you specified

the updated doc is http://lindenb.github.io/jvarkit/MsaToVcf.html

ADD REPLY
0
Entering edit mode

And it says deprecated in favour of snp-sites from Sanger-pathogens ;)

Btw, I have a slightly different problem that doesn't work well with the two tools - I have one fasta genome (not alignment) that I need to convert to VCF for merging with my other VCF dataset. Do you know any tool that can convert fasta of one sample to VCF? (It's a chimp mapped & called against hg19, I need it as outgroup in my hg19-based dataset.)

ADD REPLY
1
Entering edit mode

hard to say without see+understand the input data. You should ask this as a new question with a link to the current question explaining why those tools don't fulfill your needs.

ADD REPLY
2
Entering edit mode
8.6 years ago
Bioch'Ti ★ 1.1k

Hi,

I also found that PGDSpider (http://www.cmpg.unibe.ch/software/PGDSpider/) is able to convert fasta alignment into a vcf file. The problem is that it does not handle concatened alignements in a single fasta file (eg sequences of different sizes).

hope this helps Best, C.

ADD COMMENT
0
Entering edit mode

I was looking into this tool in the past and found I would have to probably merge the fasta into a single long sequence. I could do that, but it sounded insane, so I've found a different solution to my problem at the time (don't remember now what it was). I guess the same holds for my current problem :D :/

ADD REPLY

Login before adding your answer.

Traffic: 1192 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6