Question

Vcf normalization dificulties

0

Entering edit mode

3.7 years ago

Romeo.Dervishaj • 0

Greetings to all of you, I have landed in the wastelands of bioinformatics for my master thesis. As a beginner I am having some issues to pick up the speed, I was assigned to analyse 12 vcf files produced from pacbio and nanopore sequences thru Sniffles, they contain mostly indels. I need to merge them so that I can analyse the distribution of the indels among those files thru intersect. Currently, I am trying to normalise them, to split the reads I used bcftools norm -m-both, and it gave a correct output as below:

   ##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##source=Sniffles
##fileDate=20200704
##contig=<ID=chrA01,length=23267856>
##contig=<ID=chrA02,length=24793737>
##contig=<ID=chrA03,length=29767490>
##contig=<ID=chrA04,length=19151660>
##contig=<ID=chrA05,length=23067598>
##contig=<ID=chrA06,length=24396386>
##contig=<ID=chrA07,length=24006521>
##contig=<ID=chrA08,length=18961941>
##contig=<ID=chrA09,length=33865340>
##contig=<ID=chrA10,length=17398227>
##contig=<ID=chrC01,length=38829317>
##contig=<ID=chrC02,length=46221804>
##contig=<ID=chrC03,length=60573394>
##contig=<ID=chrC04,length=48930237>
##contig=<ID=chrC05,length=43185227>
##contig=<ID=chrC06,length=37225952>
##contig=<ID=chrC07,length=44770477>
##contig=<ID=chrC08,length=38477087>
##contig=<ID=chrC09,length=48508220>
##contig=<ID=chrA01_random,length=2687285>
##contig=<ID=chrA02_random,length=1623597>
##contig=<ID=chrA03_random,length=6015112>
##contig=<ID=chrA04_random,length=1481564>
##contig=<ID=chrA05_random,length=3017581>
##contig=<ID=chrA06_random,length=2286780>
##contig=<ID=chrA07_random,length=2069713>
##contig=<ID=chrA08_random,length=2120946>
##contig=<ID=chrA09_random,length=4134372>
##contig=<ID=chrA10_random,length=2278402>
##contig=<ID=chrC01_random,length=4437620>
##contig=<ID=chrC02_random,length=5147596>
##contig=<ID=chrC03_random,length=6509817>
##contig=<ID=chrC04_random,length=4434705>
##contig=<ID=chrC05_random,length=3728275>
##contig=<ID=chrC06_random,length=3366307>
##contig=<ID=chrC07_random,length=2949052>
##contig=<ID=chrC08_random,length=4518928>
##contig=<ID=chrC09_random,length=4437882>
##contig=<ID=chrAnn_random,length=48658326>
##contig=<ID=chrCnn_random,length=80671874>
##contig=<ID=chrUnn_random,length=8317898>
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=INVDUP,Description="InvertedDUP with unknown boundaries">
##ALT=<ID=TRA,Description="Translocation">
##ALT=<ID=INS,Description="Insertion">
##INFO=<ID=CHR2,Number=1,Type=String,Description="Chromosome for END coordinate in case of a translocation">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
##INFO=<ID=MAPQ,Number=1,Type=Integer,Description="Median mapping quality of paired-ends">
##INFO=<ID=RE,Number=1,Type=Integer,Description="read support">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=PRECISE,Number=0,Type=Flag,Description="Precise structural variation">
##INFO=<ID=UNRESOLVED,Number=0,Type=Flag,Description="An insertion that is longer than the read and thus we cannot predict the full size.">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Length of the SV">
##INFO=<ID=REF_strand,Number=2,Type=Integer,Description="Length of the SV">
##INFO=<ID=SVMETHOD,Number=1,Type=String,Description="Type of approach used to detect SV">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SEQ,Number=1,Type=String,Description="Extracted sequence from the best representative read.">
##INFO=<ID=STD_quant_start,Number=A,Type=Float,Description="STD of the start breakpoints across the reads.">
##INFO=<ID=STD_quant_stop,Number=A,Type=Float,Description="STD of the stop breakpoints across the reads.">
##INFO=<ID=Kurtosis_quant_start,Number=A,Type=Float,Description="Kurtosis value of the start breakpoints across the reads.">
##INFO=<ID=Kurtosis_quant_stop,Number=A,Type=Float,Description="Kurtosis value of the stop breakpoints across the reads.">
##INFO=<ID=SUPTYPE,Number=A,Type=String,Description="Type by which the variant is supported.(SR,ALN,NR)">
##INFO=<ID=STRANDS,Number=A,Type=String,Description="Strand orientation of the adjacency in BEDPE format (DEL:+-, DUP:-+, INV:++/--)">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency.">
##INFO=<ID=ZMW,Number=A,Type=Integer,Description="Number of ZMWs (Pacbio) supporting SV.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DR,Number=1,Type=Integer,Description="# high-quality reference reads">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality variant reads">
##bcftools_normVersion=1.9+htslib-1.9
##bcftools_normCommand=norm -m-both -o /home/mycomputer/Scrivania/master/thesis/mycomputer_vcf/mytests/normalised/output_normalized_ns.vcf.gz.vcf input_ns.vcf.gz; Date=Sat Jul 25 18:52:49 2020
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  data_ONT_bn41.bam
chrA01  33317   0   TTATTATTAAATACTTAATAAATTAAGACTTTTGTTTAGAGAATAATTGTTACTTTTATGATAATATTTAATTATTTATACGTACACATGAGAATGCTATTAATATAAAATACATTATTTATGATGGTTGACGTAAGGTCCAAATTTAATTCACTAAATGATATTTTGACAGAAGAGAAAAGGAAATCAGGTTTAAACTTATATTATAA   N   .   PASS    IMPRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=chrA01;END=33521;STD_quant_start=31.9857;STD_quant_stop=30.7147;Kurtosis_quant_start=-2.08827;Kurtosis_quant_stop=-0.295932;SVTYPE=DEL;SUPTYPE=AL,SR;SVLEN=-204;STRANDS=+-;RE=46    GT:DR:DV    ./.:.:46
chrA01  84035   1   N   GATTTTGACTCATGGTGTCACACAGAGACATTAAAAATGAGCATTTTTCATAATTATTGCACTTAAAACGAACTGAGGGAATATAACTTATAAATTTATCCAAAGATATAGTAAAATTAAAACCTAAAACTATAAATGATTTAAATATTCTAATTATTCAAAATTCTTTATAATGAGAACTCAAAATCTGATCTGGAACCGAAGTTTTTTTTTATATTTGGGAAAGTATCCTAATCTCTATCTAAAAATAGATTTATATACTTATATATATTAATTATTTTTAGATTTAATTTATATAAAACATCAAAAATGATACCAAACTATTTCAAAATTACTTGAAAATATATAGATAGTCAAAGTAAATATCTGAAATACAAAGTATACCTAAACCAAAAATACTTAAAATAATCATTGACTGCATTCCAAAATTTTAAAGCAAGCTTCAATCGATATGTTGTTTAAGTATTACATAAGTATTCAAATTTATACTTAATATATTATTTGCCTATGATTTGGAGAAATTAAAAATATATGTCGATTTAAGACTTTAAAAATAATTTAAATAGCTTATCCAACCCAAACCAAACCTAAAGATCCTAAACCAAC  .   PASS    PRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=chrA01;END=84646;STD_quant_start=0.559017;STD_quant_stop=7.18505;Kurtosis_quant_start=3.86025;Kurtosis_quant_stop=4.34983;SVTYPE=INS;SUPTYPE=AL,SR;SVLEN=617;STRANDS=+-;RE=32 GT:DR:DV    ./.:.:32

When I try to left-align my indels by providing the genome with the following command, it produces only the header.

bcftools norm -Ou -m -any -f ~/Scrivania/master/thesis/mycomputer_vcf/02_filtered_vcf/Brassica_napus_v4.1.fasta -o ~/Scrivania/master/thesis/mycomputer_vcf/mytests/normalised/left-aligned/output_left-align_ns.vcf output_normalized_ns.vcf.gz.vcf

    ##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##source=Sniffles
##fileDate=20200704
##contig=<ID=chrA01,length=23267856>
##contig=<ID=chrA02,length=24793737>
##contig=<ID=chrA03,length=29767490>
##contig=<ID=chrA04,length=19151660>
##contig=<ID=chrA05,length=23067598>
##contig=<ID=chrA06,length=24396386>
##contig=<ID=chrA07,length=24006521>
##contig=<ID=chrA08,length=18961941>
##contig=<ID=chrA09,length=33865340>
##contig=<ID=chrA10,length=17398227>
##contig=<ID=chrC01,length=38829317>
##contig=<ID=chrC02,length=46221804>
##contig=<ID=chrC03,length=60573394>
##contig=<ID=chrC04,length=48930237>
##contig=<ID=chrC05,length=43185227>
##contig=<ID=chrC06,length=37225952>
##contig=<ID=chrC07,length=44770477>
##contig=<ID=chrC08,length=38477087>
##contig=<ID=chrC09,length=48508220>
##contig=<ID=chrA01_random,length=2687285>
##contig=<ID=chrA02_random,length=1623597>
##contig=<ID=chrA03_random,length=6015112>
##contig=<ID=chrA04_random,length=1481564>
##contig=<ID=chrA05_random,length=3017581>
##contig=<ID=chrA06_random,length=2286780>
##contig=<ID=chrA07_random,length=2069713>
##contig=<ID=chrA08_random,length=2120946>
##contig=<ID=chrA09_random,length=4134372>
##contig=<ID=chrA10_random,length=2278402>
##contig=<ID=chrC01_random,length=4437620>
##contig=<ID=chrC02_random,length=5147596>
##contig=<ID=chrC03_random,length=6509817>
##contig=<ID=chrC04_random,length=4434705>
##contig=<ID=chrC05_random,length=3728275>
##contig=<ID=chrC06_random,length=3366307>
##contig=<ID=chrC07_random,length=2949052>
##contig=<ID=chrC08_random,length=4518928>
##contig=<ID=chrC09_random,length=4437882>
##contig=<ID=chrAnn_random,length=48658326>
##contig=<ID=chrCnn_random,length=80671874>
##contig=<ID=chrUnn_random,length=8317898>
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=INVDUP,Description="InvertedDUP with unknown boundaries">
##ALT=<ID=TRA,Description="Translocation">
##ALT=<ID=INS,Description="Insertion">
##INFO=<ID=CHR2,Number=1,Type=String,Description="Chromosome for END coordinate in case of a translocation">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
##INFO=<ID=MAPQ,Number=1,Type=Integer,Description="Median mapping quality of paired-ends">
##INFO=<ID=RE,Number=1,Type=Integer,Description="read support">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=PRECISE,Number=0,Type=Flag,Description="Precise structural variation">
##INFO=<ID=UNRESOLVED,Number=0,Type=Flag,Description="An insertion that is longer than the read and thus we cannot predict the full size.">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Length of the SV">
##INFO=<ID=REF_strand,Number=2,Type=Integer,Description="Length of the SV">
##INFO=<ID=SVMETHOD,Number=1,Type=String,Description="Type of approach used to detect SV">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SEQ,Number=1,Type=String,Description="Extracted sequence from the best representative read.">
##INFO=<ID=STD_quant_start,Number=A,Type=Float,Description="STD of the start breakpoints across the reads.">
##INFO=<ID=STD_quant_stop,Number=A,Type=Float,Description="STD of the stop breakpoints across the reads.">
##INFO=<ID=Kurtosis_quant_start,Number=A,Type=Float,Description="Kurtosis value of the start breakpoints across the reads.">
##INFO=<ID=Kurtosis_quant_stop,Number=A,Type=Float,Description="Kurtosis value of the stop breakpoints across the reads.">
##INFO=<ID=SUPTYPE,Number=A,Type=String,Description="Type by which the variant is supported.(SR,ALN,NR)">
##INFO=<ID=STRANDS,Number=A,Type=String,Description="Strand orientation of the adjacency in BEDPE format (DEL:+-, DUP:-+, INV:++/--)">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency.">
##INFO=<ID=ZMW,Number=A,Type=Integer,Description="Number of ZMWs (Pacbio) supporting SV.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DR,Number=1,Type=Integer,Description="# high-quality reference reads">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality variant reads">
##bcftools_normVersion=1.9+htslib-1.9
##bcftools_normCommand=norm -m-both -o /home/mycomputer/Scrivania/master/thesis/mycomputer_vcf/mytests/normalised/input_ns.vcf.gz.vcf data_ns.vcf.gz; Date=Sat Jul 25 18:52:49 2020
##bcftools_normCommand=norm -f /home/mycomputer/Scrivania/master/thesis/mycomputer_vcf/02_filtered_vcf/Brassica_napus_v4.1.fasta  -o ~/Scrivania/master/thesis/mycomputer_vcf/mytests/normalised/left-aligned/output_left-align_ns.vcf output_normalized_ns.vcf.gz.vcf; Date=Mon Jul 27 13:29:57 2020
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  data_ONT_bn41.bam

and in the terminal, it generates a warning Reference allele mismatch at chrA01:33317 .. REF_SEQ:'TAAATACTTAATAAATTAAGACTTTTGTTTAGAGAATAATTGTTACTTTTATGATAATATTTAATTATTTATACGTACACATGAGAATGCTATTAATATAAAATACATTATTTATGATGGTTGACGTAAGGTCCAAATTTAATTCACTAAATGATATTTTGACAGAAGAGAAAAGGAAATCAGGTTTAAACTTATATTATAAATTATTA' vs VCF:'TTATTATTAAATACTTAATAAATTAAGACTTTTGTTTAGAGAATAATTGTTACTTTTATGATAATATTTAATTATTTATACGTACACATGAGAATGCTATTAATATAAAATACATTATTTATGATGGTTGACGTAAGGTCCAAATTTAATTCACTAAATGATATTTTGACAGAAGAGAAAAGGAAATCAGGTTTAAACTTATATTATAA

I tried to use vt_normalize as well, and it still gave the same terminal error. I would very much appreciate some help, So thank you in advance for whoever is spending some time reading this thread. Regards Romeo

P.S. I have changed the names of the files and folders to make them as generic as possible Sorry if this may create any confusion

SNP sequence software error genome • 1.4k views

ADD COMMENT • link updated 3.7 years ago by WouterDeCoster 47k • written 3.7 years ago by Romeo.Dervishaj • 0

score 3 · Answer 1 · 2020-07-27

3

Entering edit mode

3.7 years ago

WouterDeCoster 47k

The error message indicates that the sequence that is specified in the VCF for chrA01:33317 does not match the reference at those coordinates. Make sure that the reference genome you are using is the one used for alignment. If that's the case, Sniffles used to have an off-by-one error in its coordinates until recently. That could be the problem. Try to see if the sequence reported in the VCF (in the REF column) matches the reference genome, and if isn't, see if it's shifted by one nucleotide.

ADD COMMENT • link 3.7 years ago by WouterDeCoster 47k

0

Entering edit mode

Hello, first of all, thank you for the quick reply. I appreciate very much your effort into helping me. The genome is the same used for the alignments; I opened the reference genome with igv. I browsed to the position (Sorry but don't know a method yet to extract the data from the fasta files by position number, probably by using grep could be done but in this case was the fastest alternative). Yes, the ref column and the ref genome start both at 33317 with TAAATAC but my sequence starts (33310) TTATTATTAAATACTTAATAA (33330). the data is from https://www.biorxiv.org/content/10.1101/2020.01.27.915470v1 this paper. screenshot of the vcf opened in LibreOffice calc . It may be, as you pointed, fault of the off-by-one error since the data was produced before the fix of the bug. in this case, can I edit the vcf file or is better if I just call the variants again with the updated Snfiffles from the bam files? Thank you again for your time. Is indeed a huge help.

Update: bcftools norm --check-ref e -f genome.fa input.vcf

Reference allele mismatch at chrA01:33317 .. REF_SEQ:'TAAATACTTAATAAATTAAGACTTTTGTTTAGAGAATAATTGTTACTTTTATGATAATATTTAATTATTTATACGTACACATGAGAATGCTATTAATATAAAATACATTATTTATGATGGTTGACGTAAGGTCCAAATTTAATTCACTAAATGATATTTTGACAGAAGAGAAAAGGAAATCAGGTTTAAACTTATATTATAAATTATTA' vs VCF:'(TTATTAT)TAAATACTTAATAAATTAAGACTTTTGTTTAGAGAATAATTGTTACTTTTATGATAATATTTAATTATTTATACGTACACATGAGAATGCTATTAATATAAAATACATTATTTATGATGGTTGACGTAAGGTCCAAATTTAATTCACTAAATGATATTTTGACAGAAGAGAAAAGGAAATCAGGTTTAAACTTATATTATAA'

it seams off by more than one base

ADD REPLY • link 3.7 years ago by Romeo.Dervishaj • 0