The fasta sequence does not match the REF allele at chr1:1298836 ?
1
0
Entering edit mode
2.9 years ago
RobertUt ▴ 40

Dear all,

I am making a consensus.fa from a aligned.bam using bcftools consensus.

cat reference.fa | bcftools consensus calls.vcf.gz > consensus.fa


The generated consensus.fa only contains parts of chr1 and I get the following error:

The site chr1:724953 overlaps with another variant, skipping...
The site chr1:797380 overlaps with another variant, skipping...
The fasta sequence does not match the REF allele at chr1:1298836:
.vcf: [CAGAG]
.vcf: [CAG] <- (ALT)
.fa:  [GAGAG]TTTTGTTCTTTTGCTCAGGATGGAGAGCAGTGGTGCAATC


I double checked: the reference.fa is hg19, and the bam is aligned to hg19, too.

I am appreciating if someone has any suggestions how to solve this issue.

Robert

bcftools vcf consensus • 2.3k views
0
Entering edit mode

Hello RobertUt ,

was the same reference.faused for alignment?

This doesn't explain your problem. But before creating the consensus, I strongly recommend to normalize your vcf file, by using bcftools norm.

fin swimmer

0
Entering edit mode

Hi fin,

might be a curious question: What do you normalize your vcf file for? Because I don't see how normalizing would have helped to solve my problem.

Ut

1
Entering edit mode
2.8 years ago
RobertUt ▴ 40

Problem solved:

The problem seems to occur whenever an indel is overlapping a SNP. Bcftools apparently replaces the REF C (1st line) by the ALT G, and cannot find the C in REF when trying to integrate the indel CAG (2nd line) into the reference. In our dataset there is only indels overlapping SNPs, but no SNPs overlapping another SNP.

#CHROM  POS     ID      REF     ALT     QUAL
chr1    1298836 .       C       G       145
chr1    1298836 .       CAGAG   CAG     39.9116
chr1    1298838 .       G       C       155


Two opportunities:

• skip all indels and genearate consensus.fa without indels.
• given there is only a few overlapping indels (in our case about 1300), the chance of having it in the region of interest is rather low. Therefore, it might be smarter to samtools faidx the region one is interested in and generate consensus for this region. Without overlapping indels in this region bcftools just worked fine.

I'm curious to hear if someone faced the same (similar) problem, but solved it in a different way.

Ut

1
Entering edit mode

Hello RobertUt ,

thanks for sharing. I took a closer look into that and this is what I've found out: Whenever you have two variants of different type with equal starting position, bcftools consensus is throwing the error you've posted. If you have two variants of the same type with equal starting position, bcftools skippes the later one. That there is a different behavior, is in my opinion a bug and should be reported to bcftools.

There is a way to go around this problem. bcftools norm has in option to join multiallelic sites. bcftools consensus have then an option to decide which allel you want to use in the consensus sequence.

I recommend using bcftools normin general, because then you can make sure you have the same variant description between all files. This is useful for annotation, comparison, and so on.

\$ bcftools norm -f genome.fa -m +any -Oz -o output.vcf.gz input.vcf


This would create a normalized version of your vcf and collapse all multiallelic site :

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr1    1298836 .   CAG GAG,C   145 .   .
chr1    1298838 .   G   C   155 .   .


In your bcftools consensus command use then one of values for the -H parameter:

-H, --haplotype <which>    choose which allele to use from the FORMAT/GT field, note
the codes are case-insensitive:
1: first allele from GT
2: second allele
R: REF allele in het genotypes
A: ALT allele
LR,LA: longer allele and REF/ALT if equal length
SR,SA: shorter allele and REF/ALT if equal length


fin swimmer