Question

alignement and Variant calling with BWA and samtools

0

Entering edit mode

2.9 years ago

simon.burgermeister ▴ 30

Hi,

I am trying a variant calling pipeline with BWA and Samtools. Somehow when I have more sequences to align to references the output is less variant and I lose the Indels calls (even though the sequences carrying them are still present in the extended list). Could it be due to default parameters only calling variants if they are represented by a minimum % of the population?

Here is my pipeline:

Index:

bwa index ref_file

Align:

bwa mem  ref_file all_genomes_file > alignement_file.sam

SAM to BAM:

samtools view -S -b s alignement_file.sam > alignement_file.bam

Sort:

samtools sort alignement_file.bam -o sorted.bam

Variant calling:

bcftools mpileup -B -f sorted.bam | bcftools call -mv  --multiallelic-caller --variants-only  > var.raw.vcf

variant-calling samtools BWA • 1.8k views

ADD COMMENT • link updated 20 days ago by Ram 43k • written 2.9 years ago by simon.burgermeister ▴ 30

1

Entering edit mode

the output of bwa can be piped into samtools sort

ADD REPLY • link 2.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Not only that sort can read SAM, so there is no point in converting SAM to BAM first (especially when you're compressing the BAM only to immediately decompress it).

It's sad that this meme of pointless format conversions still exists in workflow blogs all over the place. I wonder how many CPU cycles get wasted.

ADD REPLY • link 2.9 years ago by jkbonfield ★ 1.2k

0

Entering edit mode

True. As a side note don't forget that both commands would share the same cpu, therefore if you have say 8 cores available and you use a command like this one

bwa mem -t 8 | samtools sort -@ 8

then both commands would compete for the same cores. In that case it would be wiser to divide them like this

bwa mem -t 4 | samtools sort -@ 4

ADD REPLY • link 2.9 years ago by Jorge Amigo 14k

score 1 · Accepted Answer · 2021-05-06

1

Entering edit mode

2.9 years ago

Jorge Amigo 14k

The more reads you have, the more reliable the variant calling process is. If using more reads produces less variants, it means that those previously detected variants are not real variants.

From your pipeline I see that you're not removing or at least marking duplicates before calling variants. If your data comes from an hibridization enrichment process this could be a major issue you should address using duplicate removing tools such as Picard, sambamba, samblaster,...

ADD COMMENT • link 2.9 years ago by Jorge Amigo 14k

0

Entering edit mode

Sorry I misspoke: I'm not aligning reads but consensus sequences from individual samples therefore I would like all variat detected to stay, is there any way to do that?

ADD REPLY • link 2.9 years ago by simon.burgermeister ▴ 30