Question: The fasta sequence does not match the REF allele at chr1:1298836 ?
0
gravatar for RobertUt
5 weeks ago by
RobertUt10
RobertUt10 wrote:

Dear all,

I am making a consensus.fa from a aligned.bam using bcftools consensus.

cat reference.fa | bcftools consensus calls.vcf.gz > consensus.fa

The generated consensus.fa only contains parts of chr1 and I get the following error:

The site chr1:724953 overlaps with another variant, skipping...
    The site chr1:797380 overlaps with another variant, skipping...
    The fasta sequence does not match the REF allele at chr1:1298836:
       .vcf: [CAGAG]
       .vcf: [CAG] <- (ALT)
       .fa:  [GAGAG]TTTTGTTCTTTTGCTCAGGATGGAGAGCAGTGGTGCAATC

I double checked: the reference.fa is hg19, and the bam is aligned to hg19, too.

I am appreciating if someone has any suggestions how to solve this issue.

Robert

bcftools consensus vcf • 135 views
ADD COMMENTlink modified 4 weeks ago • written 5 weeks ago by RobertUt10

Hello RobertUt ,

was the same reference.faused for alignment?

This doesn't explain your problem. But before creating the consensus, I strongly recommend to normalize your vcf file, by using bcftools norm.

fin swimmer

ADD REPLYlink written 5 weeks ago by finswimmer8.9k

Hi fin,

might be a curious question: What do you normalize your vcf file for? Because I don't see how normalizing would have helped to solve my problem.

Ut

ADD REPLYlink written 4 weeks ago by RobertUt10
1
gravatar for RobertUt
4 weeks ago by
RobertUt10
RobertUt10 wrote:

Problem solved:

The problem seems to occur whenever an indel is overlapping a SNP. Bcftools apparently replaces the REF C (1st line) by the ALT G, and cannot find the C in REF when trying to integrate the indel CAG (2nd line) into the reference. In our dataset there is only indels overlapping SNPs, but no SNPs overlapping another SNP.

#CHROM  POS     ID      REF     ALT     QUAL    
chr1    1298836 .       C       G       145     
chr1    1298836 .       CAGAG   CAG     39.9116 
chr1    1298838 .       G       C       155

Two opportunities:

  • skip all indels and genearate consensus.fa without indels.
  • given there is only a few overlapping indels (in our case about 1300), the chance of having it in the region of interest is rather low. Therefore, it might be smarter to samtools faidx the region one is interested in and generate consensus for this region. Without overlapping indels in this region bcftools just worked fine.

I'm curious to hear if someone faced the same (similar) problem, but solved it in a different way.

Ut

ADD COMMENTlink written 4 weeks ago by RobertUt10
1

Hello RobertUt ,

thanks for sharing. I took a closer look into that and this is what I've found out: Whenever you have two variants of different type with equal starting position, bcftools consensus is throwing the error you've posted. If you have two variants of the same type with equal starting position, bcftools skippes the later one. That there is a different behavior, is in my opinion a bug and should be reported to bcftools.

There is a way to go around this problem. bcftools norm has in option to join multiallelic sites. bcftools consensus have then an option to decide which allel you want to use in the consensus sequence.

I recommend using bcftools normin general, because then you can make sure you have the same variant description between all files. This is useful for annotation, comparison, and so on.

$ bcftools norm -f genome.fa -m +any -Oz -o output.vcf.gz input.vcf

This would create a normalized version of your vcf and collapse all multiallelic site :

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr1    1298836 .   CAG GAG,C   145 .   .
chr1    1298838 .   G   C   155 .   .

In your bcftools consensus command use then one of values for the -H parameter:

-H, --haplotype <which>    choose which allele to use from the FORMAT/GT field, note
                           the codes are case-insensitive:
                               1: first allele from GT
                               2: second allele
                               R: REF allele in het genotypes
                               A: ALT allele
                               LR,LA: longer allele and REF/ALT if equal length
                               SR,SA: shorter allele and REF/ALT if equal length

fin swimmer

ADD REPLYlink written 4 weeks ago by finswimmer8.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 790 users visited in the last hour