Question: Generation of incorrect heterozygous calls after left normalization using bcftools
0
gravatar for nkausthu
5 weeks ago by
nkausthu20
nkausthu20 wrote:

I have following records in one of the gvcf file

1       3753032 .       GTTTT   G,GT,GTTT,GTTTTT,GTTTTTT,GTTTTTTTT,<NON_REF>

1       10502954        .       CTTTTT  C,CT,CTTT,CTTTT,CTTTTTT,<NON_REF>

1       11272829        .       T       <NON_REF>

1       11272839        .       G       <NON_REF>

1       15978128        .       T       <NON_REF>

1       15978129        .       T       <NON_REF>

1       38332078        .       T    TCA,TCTCA,TCACACACACACACACACA,TC,<NON_REF>

1       67725648        .       GAAAA   G,GAA,GAAA,GAAAAA,GAAAAAA,<NON_REF>

1       72748277        .       ATT     A,AT,ATTT,ATTTT,ATTTTT,ATTTTTT,<NON_REF>

1       150782110       .       CAAAAA  C,CA,CAA,CAAA,CAAAA,<NON_REF>

1       155724315       .       GTT     G,GT,TTT,GTTTTT,GTTTTTT,<NON_REF>

1       158058266       .       CTTTTTT C,CT,CTT,CTTT,CTTTT,CTTTTT,<NON_REF>

1       201082902       .       C       CAA,CAAA,<NON_REF>

1       212618993       .       A       C,<NON_REF>

1       237955682       .       C       CGTGT,CGTGTGT,<NON_REF>

2       27532239        .       CAAA    C,CA,CAA,CAAAAAAAAAAAAAAAAA,<NON_REF>

2       47641559        .       TAAAAAA T,TA,TAA,TAAA,TAAAA,TAAAAA,<NON_REF>

2       100058714       .       CAA     C,CA,CAAA,CAAAA,CAAAAAAA,CAAAAAAAA,<NON_REF>

2       113303450       .       T       <NON_REF>

2       113303451       .       G       <NON_REF>

2       207998878       .       AT      A,ATT,ATTT,ATTTT,ATTTTT,ATTTTTT,<NON_REF>

2       231333532       .       CAAA    C,CAA,CAAAAAAA,<NON_REF>

3       42734487        .       G       <NON_REF>

3       42734750        .       C       A,<NON_REF>

3       42734751        .       C       <NON_REF>

3       47484723        .       TACACACAC       T,TAC,<NON_REF>

when we have done left normalization using bcftools after joint genotyping, lots of false heterozygous calls has been generated with no reads supporting the altered allele as follows

0/1:14,0:42:72:149,0,164

0/1:10,0:35:17:88,0,111

I guess it's due to incorrect splitting of multi-alleles. It would be great if anyone can suggest ways to remove these variants from downstream vcf file ?

ADD COMMENTlink modified 24 days ago by Kevin Blighe50k • written 5 weeks ago by nkausthu20

Hello,

please provide an example dataset one can use directly for testing.

Thanks!

fin swimmer

ADD REPLYlink written 5 weeks ago by finswimmer12k

I can provide vcf file after left normalization is that sufficient?

ADD REPLYlink written 5 weeks ago by nkausthu20

Hello,

that's better then nothing. But the input vcf would be more useful. Reduce it to some example lines that show your problem.

ADD REPLYlink written 5 weeks ago by finswimmer12k
0
gravatar for Kevin Blighe
24 days ago by
Kevin Blighe50k
Kevin Blighe50k wrote:

Hey,

With multi-allelic sites, I think that it is better to do this in a 2 step process, and you must also make use of the reference genome against which the variants were originally called.

So, something like this:

# 1st pipe, splits multi-allelic calls into separate variant calls
# 2nd pipe, left-aligns indels and issues warnings when the REF base in your VCF does not match the base in the supplied FASTA reference genome
bcftools norm -m-any myvariants.vcf | \
  bcftools norm -Ob --check-ref w -f /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta \
  > myvariants_norm.bcf ;

Kevin

ADD COMMENTlink modified 24 days ago • written 24 days ago by Kevin Blighe50k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2048 users visited in the last hour