Question: VCF file problem
0
gravatar for lessismore
2.3 years ago by
lessismore660
Mexico
lessismore660 wrote:

Hello everybody,

i have a doubt. i have been analyzing a vcf file and i wanted to check for some of the SNPs if there was correspondence between REF nucleotide in the vcf and the effective nucleotide in the genome from which the SNP was called. In all cases the nucleotide in the specific location i checked was corresponding to the ALT nucleotide in the vcf. Maybe it's a noobie question, but id be very curious to know why (i was expecting a correspondence to REF nucleotide).

Thanks in advance

gbs vcf • 891 views
ADD COMMENTlink modified 2.3 years ago by Manuel Landesfeind1.2k • written 2.3 years ago by lessismore660

how have you checked this ? what was your command line.

ADD REPLYlink written 2.3 years ago by Pierre Lindenbaum122k

-> i started from the: genomeassembly file vcf file a list of interesting SNPs (contig and location) that we identified by SVS to have an high FST index:

contig1 76832 
contig2 691177 
contig3 144034 
contig4 190995

-> i extracted the first 5 columns from the vcf for the SNPs

cat snps | grep -w -F -f - vcf_prova.vcf | cut -f1-5

python mygenome = SeqIO.to_dict(SeqIO.parse("genomeassembly.fasta","fasta")) 
contig1 = mygenome ['contig1'] 
print contig1 [76832]

where 76832 is the position of the SNP stored in the vcf

The result i get is that the location i find in the contig is always correspondent to the nucleotide in the ALT column.

ADD REPLYlink modified 2.3 years ago by WouterDeCoster40k • written 2.3 years ago by lessismore660
1
gravatar for Pierre Lindenbaum
2.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

The following cmd line scan the SNVs in a VCF and check the REF allele vs the

$ awk -F '\t' '/^#/ || !($4 ~ /^[ATGC]$/) {next;} {printf("%s,%s,%s\n",$1,$2,$4);}'  mutations.vcf  | while IFS="," read -a A; do echo -n "${A[2]} " && samtools faidx ref.fa "${A[0]}:${A[1]}-${A[1]}"  | grep -v '^>' ; done
A A
A A
T T
T T
C C
A A
T T
A A
A A
G G
T T
T T

(...)

ok REF vcf == REF fasta , to my great relief, people won't have to retract their papers :-)

ADD COMMENTlink written 2.3 years ago by Pierre Lindenbaum122k
1
gravatar for Manuel Landesfeind
2.3 years ago by
Göttingen, Germany
Manuel Landesfeind1.2k wrote:

You can use bcftools norm for checking against a reference:

bcftools norm --fasta-ref "$genome_assembly_file" --check-ref  $ACTION "$vcf_file"

where $ACTION may be

  • w : warn
  • s : try to set/fix bad sites (make sure you verify your VCF file afterwards!)

(for additional actions see the manual)

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Manuel Landesfeind1.2k

what version of bcftools? in my version is not present Version: 0.1.19-96b5f2294a

ADD REPLYlink written 2.3 years ago by lessismore660
2

Well -- I don't know when the norm command or the --check-ref option were introduced but the Github history dates back until 2013. Don't you think your version is a little bit outdated?

ADD REPLYlink written 2.3 years ago by Manuel Landesfeind1.2k

Yes indeed. Thanks i just replaced it.

So the output i got is this for every entry: [W::vcf_parse] contig 'contig1' is not defined in the header. (Quick workaround: index the file with tabix.)

what does that mean?

ADD REPLYlink written 2.3 years ago by lessismore660
1

Your VCF file does not adhere to the VCF specifications. In a proper VCF, every template sequence (i.e., chromosome or contig) must be defined in the header, like exemplified in Section 1.1 of the specification (look for the line starting with ##contig=<ID=20). Apart from your file not being proper VCF, this is a warning and my guess is that it can be ignored.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Manuel Landesfeind1.2k

thanks a lot because im getting things clearer:

they used a reference genome but here in the vcf it says that "Reference allele is not known. The major allele was used as reference allele". So what that means? they did not set a reference genome but still they mapped again a reference and the vcf output in the column REF is imposed by the major allele??

screenshotScreenshot

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by lessismore660
2

I have no idea - but i guess yes: they somehow inferred variants and used the major allele as reference. Probably, you should contact the people that created the VCF file to ask for the methods...

Apart from that: without knowing the exact reference that was used (i.e., having the same FASTA file) I think bcftools norm will not be able to correctly annotated the reference allele.

ADD REPLYlink written 2.3 years ago by Manuel Landesfeind1.2k

yep, i just talked with the person. He created a new VCF from Tassel not defining the reference sequence e choosing the major allele as reference. Solved. Thanks a lot

ADD REPLYlink written 2.3 years ago by lessismore660
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1528 users visited in the last hour