how to get to a VCF from bam files
0
0
Entering edit mode
18 months ago
Human • 0

Hello,

My situation is as follows:

I have two groups of reads/Individuals that differ in terms of indels (one group has the indels, the other doesn't). I already Mapped them and generated bam files. So, now I am struggling to get any form of usable VCF for my purposes. I am a little bit used to FST analysis of PoolSeq reads and the comparison of pupulations in regard to SNPs and indels. In that case I usually went with straight forward mpileup and sync and then used R for the FST analysis and IGV for visualisation of the differences. But here I have several hundred individual's to look at.

What would you suggest? So far I came up with something like this:

    #!/bin/bash

(...)

module load BCFtools/1.3.1
module load SAMtools/1.3.1

# List all BAM files in current directory
bam_files=$(ls *.bam | tr '\n' ' ')

# Create a space-separated string of bam files
BAM_FILES=$(echo $bam_files | tr ' ' '\n' | paste -s -d ' ')

# Run mpileup on all bam files and pipe the output to BCFtools
samtools mpileup -g -B $BAM_FILES | bcftools view -Ou - > all_samples.bcf

Now this resulted in a bcf file which 1. seems to be suspiciously small and 2. I don't really know how to use it. I know I can assess the quality of the variant calls with a tool like bcftools stats to generate summary statistics for the BCF file, such as the number of variants, transition/transversion ratios, and other quality metrics. But is this useful here? In the end I just need a straight forward way to asses the indels of the one group of individuals of my samples. Is pooling here an option too ? Or am I maybe completely wrong ? Thanks

Mpileup Samtools BAM bcftools VCF • 6.3k views
ADD COMMENT
0
Entering edit mode

See: https://samtools.github.io/bcftools/howtos/variant-calling.html

The -O option controls what kind of file you get. If you want regular VCF files then take a look at the inline help. -Oz is what you likely want to use.

ADD REPLY
0
Entering edit mode

now it didn't work at all... could you recommend a script approach and give me some code advise? that would be very helpful :) thanks

ADD REPLY
0
Entering edit mode

what do you think about this?

#!/bin/bash

#SBATCH --partition=mpcb.p
#SBATCH --ntasks=1
#SBATCH --time=0-1:00
#SBATCH --mem-per-cpu=150G
#SBATCH --job-name=VCF
#SBATCH --output="VCF_LOG.txt"
#SBATCH --error="VCF_ERROR.txt"

module load BCFtools/1.3.1
module load SAMtools/1.3.1

# Loop through all bam files in current directory
for bam_file in *.bam
do
    # Get the sample ID from the file name
    sample=$(basename "$bam_file" .bam)

    # Run mpileup and call variants for the current sample
    bcftools mpileup -Ou -f Reference.fasta "$bam_file" | bcftools call -mv -Ob -o "${sample}".bcf

    # Convert BCF to VCF format and index the VCF file
    bcftools view "${sample}".bcf -o "${sample}".vcf
    bcftools index "${sample}".vcf
done
ADD REPLY
0
Entering edit mode

GenoMax is way more experienced than I am and they're helping you so I'd wait a bit longer. BTW, this might be a good time to learn snakemake - your workflow is growing and loops will only get you so far. Snakemake's dry-run option can be super useful.

For example, a simple rule for the code you wrote above would be (not tested, test with snakemake -np <sample_name>.vcf.gz.tbi:

rule bam2vcf:
    input:
        in_bam="{sample}.bam"
    output:
        out_vcf_index="{sample}.vcf.gz.tbi"
    params:
        ref_fasta="/path/to/ref.fa"
    message:
        "Calling variants on {wildcards.sample} BAM"
    shell:
        """
        bcftools mpileup -Ou -f Reference.fasta {input.in_bam} | bcftools call -mv -Oz -o {wildcards.sample}.vcf.gz
        tabix -p vcf {wildcards.sample}.vcf.gz
        """

Save that to Snakefile and then you can simply run:

snakemake -np -s Snakefile <sample_name>.vcf.gz.tbi

Once the dry run looks good, run

snakemake --jobs 1 -s Snakefile <sample_name>.vcf.gz.tbi

in the SLURM script. You can also invoke snakemake with the --slurm option to have snakemake submit the job for you, but let's take things one step at a time.

ADD REPLY
0
Entering edit mode

Thanks! I definitely gonna give it a try. Would you also recommend this for my short term success in terms of this task too? I definitely wanna go deeper in all of this and snakemake looks pretty helpful in the long run, but I would like to get a little forward with this script too. right know I'm getting this errors for all of the samples :

Note: Neither --ploidy nor --ploidy-file given, assuming all sites are diploid
[E::main] unrecognized command 'mpileup'
Failed to open -: unknown file type
[E::hts_open_format] fail to open file '0.sorted.bcf'
Failed to open 0.sorted.bcf: No such file or directory
[E::hts_open_format] fail to open file '0.sorted.vcf'
Failed to read 0.sorted.vcf

This is the script I am using :

#!/bin/bash

#SBATCH (...)

module load BCFtools/1.3.1
module load SAMtools/1.3.1

for bam_file in *.bam
do

    # Get the sample ID from the file name
    sample=$(basename "$bam_file" .bam)

    # Run mpileup and call variants for the current sample
    bcftools mpileup -Ou -f Reference.fasta "$bam_file" | bcftools call -mv -Ob -o "${sample}.bcf"

    # Convert BCF to VCF format and index the VCF file
    bcftools view "${sample}.bcf" -o "${sample}.vcf"
    bcftools index "${sample}.vcf"

done
ADD REPLY
1
Entering edit mode

You are using an ancient version of bcftools (from 2016). Current version is 1.17. Please ask your system admins to upgrade, if that is the only module available. Version numbers on Samtools go 1.2,1.3,1.4...,1.10,1.11 etc. So if you have a module that says samtools/1.17 that is the latest.

It is also not efficient to use a for loop inside a SLURM job. You should use a for loop outside to submit multiple jobs in parallel for all samples.

ADD REPLY
0
Entering edit mode

I agree with GenoMax - you were using array jobs and now you've taken 2 steps backward - a series of jobs submitted using a shell loop is the middle ground.

ADD REPLY
0
Entering edit mode

Thanks a lot GenoMax and Ram! After a short look into that manual https://samtools.github.io/bcftools/howtos/variant-calling.html I went with a pretty straight forward way as they suggested and just did this in a SLURM job without array jobs:

module load BCFtools/1.15.1-GCC-8.3.0 
bcftools mpileup -f Reference.fasta *.bam | bcftools call -mv -Ob -o Test_calls.bcf

It resulted in a pretty small (2KB) .bcf file. While most bam files are between 500 and 1000 KB. I assume it is possible, depending on the amount of variances (?) but still feels suspicious. Now I am wondering if that script might be too simple and misses sth. What do you guys think.

ADD REPLY
0
Entering edit mode

Read my script - change your bcftools to output VCF, not BCF. File size (as long as it's >0) is not an indicator of anything significant.

*.bam is too broad, ensure you use a narrower glob - something that matches a smaller subset of files.

ADD REPLY
0
Entering edit mode

Ok, but in general this approach would work? And why would I not want to use all the bam and Bam.bai files? Why would I subset it?

ADD REPLY
1
Entering edit mode

If you only have BAM files you need in the directory you are working in then *.bam would be ok. Most people working with human samples call variants individually so this feels odd.

If you have several hundred sample files getting a 2kb bcf file indeed sounds suspicious. Are you actually seeing calls in that file?

ADD REPLY
0
Entering edit mode

The approach works, yes, but you're no longer calling per-sample variants. *.bam is dangerous for a beginner - the chances of something going wrong are high and you noticing when it does go wrong are low. That's a dangerous combination. The narrower the glob, the more control you have. Here's an example from a relatively recent mistake I made.

I was using snakemake and I needed to replace all files that had .. in their name with just . - remove a . from xyz..bam for example. My snakemake file had the line for f in *..* do; mv ${f} ${f/../.}; done. I tested it and it went fine and I thought it was all great. This line was the least complicated of 50+ commands and concepts. The other concepts had (in my mind) a higher probability of causing failure so when I started getting "file not found" errors pointing to a possibility in HPC file system latency, I thought one of the major tool commands was failing or snakemake was not playing well with my HPC. Things would work sometimes, other times they would not. There was no single factor I could see that predicated failure. Not even job submission order mattered.

It took a while for me to realize what was going on. I was running a bunch of snakemake jobs in parallel in the same directory, and the mv command was renaming all the .. files it encountered. This meant that the first job to reach the mv line was altering files being worked on by other jobs, and the other jobs were complaining because their files were disappearing. All because my glob wasn't narrow enough. When I added {wildcards.sample} to the glob (for f in {wildcards.sample}*..* do...), everything started running smoothly.

That's why I say that broad globs can trip you up when you least expect it. Go with narrow, predictable globs.

ADD REPLY

Login before adding your answer.

Traffic: 1371 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6