Entering edit mode
6.0 years ago
marongiu.luigi
▴
710
Dear all,
Would be possible to convert a fasta or genbank file into a variant calling file VCF or the only source is a GFF?
Thank you
This makes no sense as composed. What exactly are you trying to do?
You can take fastq file, fasta reference genome and GFF annotation file to call variant and get VCF files.
fasta/fastq files are reads/sequences files
GFF and GTF are annotation files (where are genes, exons...)
VCF or Variant Call Format is the ouput of variant caller softwares
The files you mentioned in your post are completely different
If I have a fasta reference file for the organism X and its correspondent geneBank file, would it be possible to generate the VCF? Or the only way to obtain a VCF is by using a GFF file for the organism X. I know they are different files, the problem is how to make them. For humans and other selected organisms, these files are already present in the public domain. what happens when I have to make them from scratch?
Not unless you map your data to that reference. I assume you are thinking of using the annotation present in GenBank file to annotate any variants you find? As far as the actual sequence goes there is no difference in that in either file.
Most traditional variant annotation tools will use a GFF/GTF file but you still need to call variants independently. You do not need a GFF file to call variants, which are commonly stored in VCF files.
Based on your past posts you do this type of work with human data. It should be directly applicable for other species.
If you are looking for already known variants for human, for example, you can take a look at the dbSNP. If you want to discover new variants, you will have to "call" these variants using a variant caller
And to call variants you need aligned sequences, which you have in a bam file
Found this about calling variant without reference genome
Thank you for the tip, but this looks a bit of an overkill, even because technically I am working with WGS, not RNAseq. The problem I am facing is: in order to call the variants I first need to re-align the reads to obtain a more accurate picture of the genomic variation; this step is done with the base quality score recalibration (BQSR), but the command -- at least with GATK's implementation -- requires a VCF file to feed the -knownSites/--known-sites option. Thus this looks to me as a circular approach: I need a VCF to generate a VCF. The only things I have so far -- for non-human genomes -- are the fasta and the genbank files. SO the question is: how can I generate a VCF from these files? -- if possible.
As far as I know you don't need to use the vcf of the sample you sequenced. It just requires variants which are common in the population.
This information should have been in your initial post.
I wanted to make a very general question, independent from the GATK implementation, that is: can I build a VCF from fasta/genbank?
Okay, here comes the general answer: no.
fair enough, case closed
I have a consensus fasta file from a de novo genome and the associated GFF file, what tool can I use to convert fasta to vcf? Bastien, above, said this was possible?