how to get to a VCF from bam files
Entering edit mode
7 months ago
Nilo • 0

Hi everyone!

I generated 4 sorted .bam files from paired end data (1 plant per file). When I visualize this in IGV i can clearly see the alignments and the SNPs compared to the reference genome. (you can see in the picture attached:) )

I now want to find SNPS that are linked to a trait of interest in my plants, normally I use TASSEL5 to visualize the .vcf file and do statistics on the SNPs. However, it does not seem to be able to get the 4 bam files into 1 vcf file.

When I use a python script and load the generated vcf file into TASSEL5 it says Error: Index: 0, Size: 0

First I thought that there might be no variants, but since i saw the variants in IGV already i know that they are present. Anyone tips?

Major thanks in advance!


Ps: I now use this code:

import subprocess

# List of input BAM files
bam_files = ["sample1_sorted.bam", "sample2_sorted.bam", "sample3_sorted.bam", "sample4_sorted.bam"]

# Output VCF file name
output_vcf = "output.vcf"

# Define a function to perform variant calling using bcftools
def call_variants(bam_files, output_vcf):
    # Create a command to call variants using bcftools
    bcftools_cmd = [
        "-f", "reference.fasta",  # Replace with your reference genome file
    ] + bam_files + [
        "bcftools", "call",
        "--skip-variants", "indels",
        "-o", output_vcf,

    # Run the bcftools command" ".join(bcftools_cmd), shell=True, check=True)

if __name__ == "__main__":
    # Call variants and generate the VCF file
    call_variants(bam_files, output_vcf)

enter image description here

VCF BAM • 764 views
Entering edit mode

before looking in IGV, can you just see the variant in the VCF file ?

Entering edit mode

no i cant... But, i tried it with more paired end sequences and now it works... Maybe there was too little overlap afterall...

Entering edit mode

only problem now is that i see many variants in the bam files, but very very very few end up in the VCF file, any idea how that comes?

enter image description here

See this picture, how comes that the A and the TT which all seem to be structurally present in the sorted.bam file do not end up in the VCF file?? The blue and cyan blocks above are the variantlocations accoring to the VCF file, screenshot taken from IGV...

Anyone an idea?

Entering edit mode

What you are seeing are not true SNPs but differences between the reads and the reference. If a SNP has a lot of support (i.e. lots of reads with the same base), then the variant caller will more likely call it as a variant. But there is more to calling variants then comparing reads to the reference. Read depth, read balance, mapping scores, per base scores etc etc are all factored in.

You should also see what bcftools is doing, go read the manual on how it calls variants. You are running it on default settings, what are the defaults? Do they make sense for your experience?

Do not look at the BAM file directly, you're just eyeballing things.


Login before adding your answer.

Traffic: 1993 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6