Question

Getting A Vcf File From A Fasta Alignment

5

Entering edit mode

11.4 years ago

Bioch'Ti ★ 1.1k

Hi All,

I'm wondering if anybody is aware of a tool that converts a fasta alignment (that includes fasta sequences of different individuals) into a vcf file? I know that the opposite is feasible (vcf to fasta) using the GATK dedicated tool, but I want to perform the opposite !

Thanks for your help, C.

vcf fasta conversion alignment • 38k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 11.4 years ago by Bioch'Ti ★ 1.1k

0

Entering edit mode

what's your input ? a CLUSTALW aln file ?

ADD REPLY • link 11.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

The input file is a txt file (.fas for example) that contains only the aligned sequences (two lines per individual : one of comment, the second one for the sequence itself). Here is an example:

>Ind1
ACGTGGCTAGATCA
>Ind2
ACGTGGCTAGATCA
>Ind3
ACGTGCCTAGATCA

ADD REPLY • link updated 11.4 years ago by Pierre Lindenbaum 166k • written 11.4 years ago by Bioch'Ti ★ 1.1k

0

Entering edit mode

how are managed the indels ?

ADD REPLY • link 11.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

It will be nice to manage the indels too to make the script more general. But in my case, I do not need to consider these markers.

ADD REPLY • link 11.4 years ago by Bioch'Ti ★ 1.1k

0

Entering edit mode

I added a support to MSA/Fasta. See below.

ADD REPLY • link 11.4 years ago by Pierre Lindenbaum 166k

Ram · Answer 1 · 2014-03-05

9

Entering edit mode

11.4 years ago

Pierre Lindenbaum 166k

I quickly wrote something for a CLUSTALW alignment (not fasta). See: http://lindenb.github.io/jvarkit/MsaToVcf.html

It seems to work with the following clustalw input:

$ curl https://raw.github.com/biopython/biopython/master/Tests/Clustalw/opuntia.aln

CLUSTAL W (1.81) multiple sequence alignment


gi|6273285|gb|AF191659.1|AF191      TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273284|gb|AF191658.1|AF191      TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273287|gb|AF191661.1|AF191      TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273286|gb|AF191660.1|AF191      TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273290|gb|AF191664.1|AF191      TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273289|gb|AF191663.1|AF191      TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273291|gb|AF191665.1|AF191      TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
                                    ******* **** *************************************

gi|6273285|gb|AF191659.1|AF191      TATATA----------ATATATTTCAAATTTCCTTATATACCCAAATATA
gi|6273284|gb|AF191658.1|AF191      TATATATA--------ATATATTTCAAATTTCCTTATATACCCAAATATA
gi|6273287|gb|AF191661.1|AF191      TATATA----------ATATATTTCAAATTTCCTTATATATCCAAATATA
gi|6273286|gb|AF191660.1|AF191      TATATA----------ATATATTTATAATTTCCTTATATATCCAAATATA
gi|6273290|gb|AF191664.1|AF191      TATATATATA------ATATATTTCAAATTCCCTTATATATCCAAATATA
gi|6273289|gb|AF191663.1|AF191      TATATATATA------ATATATTTCAAATTCCCTTATATATCCAAATATA
gi|6273291|gb|AF191665.1|AF191      TATATATATATATATAATATATTTCAAATTCCCTTATATATCCAAATATA
                                    ******          ********  **** ********* *********

gi|6273285|gb|AF191659.1|AF191      AAAATATCTAATAAATTAGATGAATATCAAAGAATCCATTGATTTAGTGT
gi|6273284|gb|AF191658.1|AF191      AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGT
gi|6273287|gb|AF191661.1|AF191      AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGT
gi|6273286|gb|AF191660.1|AF191      AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGT
gi|6273290|gb|AF191664.1|AF191      AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGT
gi|6273289|gb|AF191663.1|AF191      AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTAT
gi|6273291|gb|AF191665.1|AF191      AAAATATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGT
                                    ************************************ *********** *

gi|6273285|gb|AF191659.1|AF191      ACCAGA
gi|6273284|gb|AF191658.1|AF191      ACCAGA
gi|6273287|gb|AF191661.1|AF191      ACCAGA
gi|6273286|gb|AF191660.1|AF191      ACCAGA
gi|6273290|gb|AF191664.1|AF191      ACCAGA
gi|6273289|gb|AF191663.1|AF191      ACCAGA
gi|6273291|gb|AF191665.1|AF191      ACCAGA
                                    ******



$ curl https://raw.github.com/biopython/biopython/master/Tests/Clustalw/opuntia.aln" |\
  java -jar dist/biostar94573.jar

##fileformat=VCFv4.1
##Biostar94573CmdLine=
##Biostar94573Version=ca765415946f3ed0827af0773128178bc6aa2f62
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth.">
##contig=<ID=chrUn,length=156>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  gi|6273284|gb|AF191658.1|AF191  gi|6273285|gb|AF191659.1|AF191  gi|6273286|gb|AF191660.1|AF191  gi|6273287|gb|AF191661.1|AF191  gi|6273289|gb|AF191663.1|AF191  gi|6273290|gb|AF191664.1|AF191  gi|6273291|gb|AF191665.1|AF191
chrUn   8   .   T   A   .   .   DP=7    GT:DP   0:1 0:1 1:1 0:1 0:1 0:1 0:1
chrUn   13  .   A   G   .   .   DP=7    GT:DP   0:1 0:1 0:1 0:1 1:1 1:1 1:1
chrUn   56  .   ATATATATATA ATA,A,ATATA .   .   DP=7    GT:DP   1:1 2:1 2:1 2:1 3:1 3:1 0:1
chrUn   74  .   TCA TAT .   .   DP=7    GT:DP   0:1 0:1 1:1 0:1 0:1 0:1 0:1
chrUn   81  .   T   C   .   .   DP=7    GT:DP   0:1 0:1 0:1 0:1 1:1 1:1 1:1
chrUn   91  .   T   C   .   .   DP=7    GT:DP   1:1 1:1 0:1 0:1 0:1 0:1 0:1
chrUn   137 .   T   C   .   .   DP=7    GT:DP   0:1 1:1 0:1 0:1 0:1 0:1 0:1
chrUn   149 .   G   A   .   .   DP=7    GT:DP   0:1 0:1 0:1 0:1 1:1 0:1 0:1

Edit

I added a support for fasta/MSA

$ cat input.msa 
>Ind1
ACGTGGCTAGATCA
>Ind2
ACGTGGCTAGATCA
>Ind3
ACGTGCCTAGATCA

get the output:

$ cat input.msa | java -jar dist/biostar94573.jar 
##fileformat=VCFv4.1
##Biostar94573CmdLine=
##Biostar94573Version=4094eaed3dd2c9364309c20f88c6f79ad54a7450
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth.">
##contig=<ID=chrUn,length=14>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Ind1    Ind2    Ind3
chrUn   6   .   G   C   .   .   DP=3    GT:DP   0:1 0:1 1

ADD COMMENT • link 3.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hi Pierre,

I am looking for a program for converting fasta -> vcf and I found this discussion and your code very helpful. However, I have a couple of questions regarding this output -

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Ind1    Ind2    Ind3

chrUn 6 . G C . . DP=3 GT:DP 0:1 0:1 1

Shouldn't the genotype for Ind1 = 0/0 and Ind2 = 0/0 and Ind3 = 1/1? Also does this program allow input as a consensus sequence - for instance, there could be sites such Y and R in the sequence and the vcf could translate these as a A/T or C/G?

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by diviya.smith ▴ 60

0

Entering edit mode

you're right:I've fixed the bug! thank you

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Ind1    Ind2    Ind3 chrUn   6       .       G       C       .       .       DP=3    GT:DP   0/0:1   0/0:1   1/1:1

consensus sequence: no(I don't have time for this)

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks! This is very helpful.

ADD REPLY • link 10.7 years ago by diviya.smith ▴ 60

0

Entering edit mode

Hi Pierre, I have many vcf files now, generated courtesy of your script. I only need SNP counts for the purposes of my analyses. Is there a way to get SNP counts (only) for all my vcf files? I thought to use grep, but I noticed I was getting reports with indels, too.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.4 years ago by modernsynthesis ▴ 30

1

Entering edit mode

Use bcftools filter with TYPE="snp" http://samtools.github.io/bcftools/bcftools.html#expressions

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

[filter.c:1298 filters_init1] Error: the tag "INFO/snp" is not defined in the VCF header

error in vcf file generated by your script, when using bcftools filter with TYPE="snp".

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 9.9 years ago by he.zhengshan • 0

0

Entering edit mode

https://vcftools.github.io/man_latest.html

vcftools --vcf input_file.vcf --remove-indels --recode --recode-INFO-all --out SNPs_only

I finally fix it!

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 9.9 years ago by he.zhengshan • 0

0

Entering edit mode

hrUn    14005    .    TCG    TCA,TAG ##actually 2 snp
chrUn    28819    .    TTCA    TGAA,TTCC ##actually 3snp

Could you separate them, not together?

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 9.9 years ago by he.zhengshan • 0

0

Entering edit mode

This can be done with other tools - you want to decompose complex variants into simple variants. I've never tried, but I think e.g. bcftools norm could be used for this job. I think also vcflib should have some tool for this.

ADD REPLY • link 3.4 years ago by jena ▴ 330

0

Entering edit mode

Pierre, many thanks for providing this script, works very well. I used this script for one fasta alignment, but I would like to use it for thousands of fasta alignments. Is there a way to do this separately for each locus all at once?

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by modernsynthesis ▴ 30

0

Entering edit mode

just concatenate

the files with cat?
or merge the vcf later?

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Right, but I think that would be violating homology, no? Desired workflow: fasta > vcf > snp count (for each locus/alignment). Many thanks in advance.

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by modernsynthesis ▴ 30

0

Entering edit mode

Apologies if I am not making any sense. Either way, thank you for providing script.

ADD REPLY • link 10.4 years ago by modernsynthesis ▴ 30

0

Entering edit mode

Should I use a for loop?

ADD REPLY • link 10.4 years ago by modernsynthesis ▴ 30

0

Entering edit mode

Hi Pierre,

Thanks for this! Would it be possible to output EVERY position on the reference as VCF format? (no coverage = ./., and if same as reference = 0/0)? Otherwise I can't merge VCF files of different samples together....

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 9.9 years ago by Coryza ▴ 430

0

Entering edit mode

ask this as an issue at https://github.com/lindenb/jvarkit/issues . Anyway, I away from my sources for a few days.

ADD REPLY • link 9.9 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I think this should work fine with bcftools merge. as long as the VCFs are all called against the same reference.

ADD REPLY • link 3.4 years ago by jena ▴ 330

0

Entering edit mode

I try to run this: cat merged_fasta.aln | java -jar dist/biostar94573.jar but it gives me this: Unable to access jarfile dist/biostar94573.jar

ADD REPLY • link 4.0 years ago by FadyNabil ▴ 20

0

Entering edit mode

Do you have the program from github in the path you specified (i.e. dist/biostar94573.jar)? Github link is in the first post..

ADD REPLY • link 3.4 years ago by jena ▴ 330

0

Entering edit mode

Do you have the program from github in the path you specified

the updated doc is http://lindenb.github.io/jvarkit/MsaToVcf.html

ADD REPLY • link 3.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

And it says deprecated in favour of snp-sites from Sanger-pathogens ;)

Btw, I have a slightly different problem that doesn't work well with the two tools - I have one fasta genome (not alignment) that I need to convert to VCF for merging with my other VCF dataset. Do you know any tool that can convert fasta of one sample to VCF? (It's a chimp mapped & called against hg19, I need it as outgroup in my hg19-based dataset.)

ADD REPLY • link 3.4 years ago by jena ▴ 330

1

Entering edit mode

hard to say without see+understand the input data. You should ask this as a new question with a link to the current question explaining why those tools don't fulfill your needs.

ADD REPLY • link 3.4 years ago by Pierre Lindenbaum 166k

score 2 · Answer 2 · 2014-03-07

2

Entering edit mode

11.4 years ago

Bioch'Ti ★ 1.1k

Hi,

I also found that PGDSpider (http://www.cmpg.unibe.ch/software/PGDSpider/) is able to convert fasta alignment into a vcf file. The problem is that it does not handle concatened alignements in a single fasta file (eg sequences of different sizes).

hope this helps Best, C.

ADD COMMENT • link 11.4 years ago by Bioch'Ti ★ 1.1k

0

Entering edit mode

I was looking into this tool in the past and found I would have to probably merge the fasta into a single long sequence. I could do that, but it sounded insane, so I've found a different solution to my problem at the time (don't remember now what it was). I guess the same holds for my current problem :D :/

ADD REPLY • link 3.4 years ago by jena ▴ 330